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INSTRUCTION CONVERTING APPARATUS instructions, two different formats are used. Instructions that 

USING PARALLEL EXECUTION CODE require a large number of bits use a first format composed of 

two units, units 1 and 2, while instructions that only require 

BACKGROUND OF THE INVENTION few bits use a second format composed of one unit, unit3. 

1 _?• u *■ «u i 5 Here, instructions that have a length of one unit are called 

1. Held of the Invention „ . * . . . i_ i .i_ c 

short instructions , while instructions that have a length of 

The present invention relates to an instruction conversion lwo mits m ca „ ed <«, instructions", 

apparatus, a processor, a storage medium storing parallel Whilc ^ m ^ short and , instmctions , instruc . 

execution codes to which a plurality of instructions have ^ ^ lied thrcc units at a ^ ^ no attcntion 

been assigned, and a computer-readable storage medium 3Q w ^ differenccs in t 

stonng an instruction conversion program that generates t_ . Cl „ . „ , „v c 

such parallel execution codes. In particular, the invention f ™" 1B sh ™ s lhe f UDlls ( hereafier . «Ued "packets") for 

relateTtoatechniquefordecreasingthenumberofexecution ^ructions from memory in each .cycle ; m this 

• «ffl^ A ™ k« «„„iui fixed-supply/variable-execution method. FIG. 1C, 

cycles and improving code efficiency by using parallel i_.f i. .l u J 

rocessnc meanwhile, shows the minimum units (hereafter called 

p . 15 "execution units") for decoding and execution by this pro- 

2. Description of the Background Art cessor 

In recent years, parallel processing methods have been During execution, all instructions in an area in FIG. IB 

widely used in the development of microprocessors. Parallel demarcated by parallel processing boundaries are executed 

processing refers to the execution of a plurality of instruc- ^ par allel in one cycle. This means that in each cycle 

tions in each machine cycle. Examples of classic parallel 20 instructions are executed in parallel as far as the instruction 

processing techniques are superscalar methods and VLIW ^at ^ ^ thc ncxt parallel processing boundary shown in 

(Very Long Instruction Word) methods. pi G . IB using shading. Instructions that have been supplied 

In superscalar methods, specialized circuitry in the pro- but are not executed are accumulated in an instruction buffer 

cessor dynamically analyzes which instructions can be and are executed in a following cycle, 

executed in parallel and then has these instructions executed 25 i D FIG. IB, the parallel processing boundary is set at 

in parallel. These methods have an advantage in that super- U nit6, so that all units from unitl to unit6 are set as one 

scalar processors can be made rompatible with serial pro- execution unit. Of these units, unitl~unit2, unit3~unit4, and 

cessing methods. This means that object code that has beeo unit5~unit6 each compose a long instruction, so that these 

generated by a compiler for a serial processor can be three long instructions are executed in parallel, 

executed in its original state by a superscalar processor. A 30 ^ next para Uel processing boundary in FIG. IB is set at 

disadvantage of superscalar techniques is that specialized urjilll> ^ that alI units from unit7 to unitll are executed in 

hardware needs to be provided in the processor to dynami- one execution unit. Of these units, unit7~unit8 compose a 

cally analyze the parallelism of instructions, which leads to i ong instruction, unit9 composes a short instruction, and 

an increase in hardware costs. Another disadvantage is that unitlO-unitll compose a long instruction. These three 

the provision of specialized hardware makes it difficult to 35 instructions are executed in parallel, 

raise the operation clock frequency. Tq mis method> instructions m supp lied using a fixed- 

In VLIW methods, a plurality of instructions that can be length packet, and a suitable number of units is issued in 

executed in parallel are arranged into an executable code of eacn C y C j e based 0 n information that is found through static 

a fixed length, with the instructions in the same executable analysis. Using this method, there is absolutely no need to 

code being executed in parallel. For VLIW methods, an insert the no operation instructions (NOP codes) that are 

"executable code" is a unit of data that is fetched from required in conventional VLIW methods with fixed length 

memory in one cycle or is decoded and executed in one instructions. As a result, code size can be reduced. 

cvc l e - The following describes the hardware construction of a 

For VLIW methods, there is no need during execution for 4J processor for this fixed-supply/variable-execution method, 

the processor to analyze which instructions can be executed FIG 2 is a block diagram showing the construction of the 

in parallel. This means that little hardware is required, and instruction register and periphery in a processor that is 

that raising the operation clock frequency is easy. However, capable of executing three instructions in parallel. The 

the use of fixed-length instructions leads to the problems broken lines in FIG. 2 show the control flows. The unit 

described below. _ Q q U eue in FIG. 2 is a sequence of units. These units are 

In VLIW executable codes, there is a significant variation transferred to the instruction registers in the order in which 

in the number of bits required to define different kinds of they were supplied from the instruction memory (or similar), 

instructions. As examples, instructions that deal with a long i n this construction, the instruction register A 52a and the 

constant, such as an address or an immediate, require a large instruction register B 526 form one pair, as do the instruction 

number of bits, while instructions that perform calculations 55 register C 52c~the instruction register D S2d and the instruc- 

using registers may be defined using fewer bits. As stated uon register E 52e~the instruction register F 52/ Instruc- 

above, VLIW deal with executable codes of a fixed length, tions are always arranged so as to start from one of the 

so that NOP codes need to be inserted into instructions that instruction register A 52a, the instruction register C 52c, and 

only require a low number of bits. This increases code size. me instruction register E 52e. Only when an instruction is 

To solve this problem, a technique that fetches a fixed go formed of two linked units is part of the instruction sent to 

amount of code from memory in each cycle but decodes and the other instruction register in a pair. As a result, when the 

executes a variable amount of code has been proposed in unit transferred to the instruction register 52a is a complete 

recent years. Hereafter, this technique will be referred to as instruction in itself, no unit is transferred to the instruction 

the "fixed-supply/variable -execution method". register B S2b. 

FIG. 1A shows the instruction supply unit used in the 65 The main characteristic of the above processor is that 

fixed-supply/variable-execution method. Since there is parallel processing can be performed for any combination of 

variation in the number of bits needed to define different short and long instructions. 
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When three long instructions are to be executed in The Approach to Multiple Instruction Execution in the 

parallel, the three long instructions will be composed of GMICRO/400 Processor given in PROCEEDINGS, The 

three pairs unitl~unit2, unit3~unit4, and unit5~unit6 in the Eighth TRON Project Symposium (International) 1991. 

unit queue 50. The present processor stores the first long FIG. 3A is a block diagram showing the construction of 

instruction in the pair of the instruction register A 5 the instruction register and periphery for the instruction 

52a~instruclion register B 526, the second long instruction issuing control method used by the GMICRO/400 processor, 

in the pair of the instruction register C 52c~instruction t d pjG. 3A, the broken lines show the control flows. The 

register D 52d, and the third long instruction in the pair of constant operands S4a~S4b are indicated by the output of 

the instruction register E 52e~instruction register F 52/. the first instrucUon decoder 53Mhe third instruction decoder 

After being stored in this way, the three long instructions are 10 53*. Each instruction decoder decodes an inputted instruc- 

executed by the first instruction decoder 53a-third instruc- Uon and outputs signals to the execution control unit to 

tion decoder 53c. control the execution of the instruction, as well as outputting 

When the three instructions to be executed in parallel are the constant operands indicated in the instruction, 

the long instruction composed of unitl~unit2, the short The instrucUon issuing control method of the GMICRO/ 

instruction composed of unit3, and the long instrucUon * 5 400 processor decodes the combination unitl~unit2, and 

composed of unit5~unit6, the present processor stores the un it2 and unit3 separately. After the decoding of the first 

first instruction in the pair of the instruction register A instruction decoder 53/ has clarified whether the first instruc- 

52a~instruction register B 52b, the second instruction in the \i oa is a one-unit instruction or a two-unit instruction, the 

instruction register C 52c, and the third instruction in the selector 51$ Is controlled so that the decoding result of only 

pair of the instruction register E 52e-instruction register F 0 ne of the second instruction decoder 53; and the third 

52/. Nothing is stored in the instruction register D 52d. After instruction decoder 53* is selected and used. As a result, the 

being stored in this way, the three instructions are executed processor can execute both instructions in either the short 

by the first instruction decoder 53a~third instruction decoder instruction-short instruction combination or the short 

"53c. instruction-long instruction combination of FIG. 3B in par- 

When unitl~unit2 and unit3~unit4 in the unit queue 50 25 allel. 

compose two long instructions and unit5 composes one short As shown in FIG. 3A, the GMICRO/400 decreases the 

instruction, the present processor stores the first instruction number of instructions that can be executed in parallel from 

in the pair of the instruction register A 52a~instrucuon three to two, so that only two decoders are provided 1 . The 

register B 52b, the second instruction in the pair of the second instruction decoder 53/ and the third instruction 

instruction register C 52c-instruction register D S2d, and the decoder 53* also have input ports that are only one unit 

third instruction in the instrucUon register E 52e. Nothing is wide, so that hardware reductions can be made. 

Stored in the instruction register F 52/ After being Stored in l Trans]alor'8 note: Apparent mistake in the original Japanese. Three decoders 

this way, the three instructions are executed by the first are present 

instruction decoder 53a~third instruction decoder 53c. The above processor has a different problem, however, in 
As should be clear from the above description, there is no 35 mat dcs P itc being equipped with three decoders, only two 
universal definition of the instruction register to which each infractions can be executed in parallel, represenUng a 
unit is the unit queue is to be Uansferred. There is also no marked dccrease m P^allehsm when compared with the 
universal definition of the units in the unit queue that are to hardware shown m FIG. 2. The second of the two instruc- 
be transferred to each instruction register. For this reason, ^ tlons that ca ° bc processed in P™ 11 * 1 * »jso limited to one 
the selectors Sla-Sld are provided to determine the desti- 40 uml ' P vm S ™ c 10 lhe further 'estricUon of short instruction- 
nations of units transferred from the unit queue. These lon 8 instruction combinations also being prohibited, 
selectors 51a-51d are controlled in the following way. First, CT T . „ , A nv „ IT? Ilvr . reivr ™ XT 
control is performed to determine the output destination of SUMMARY OF THE INVENTION 
selectors 51a and 516, and the units to be uansferred to the 45 It is a primary object of the present invention to provide 
instruction registers C 52c~instruction register D 52d are a processor that does not need a large hardware scale and can 
determined. Once the units to be transferred have been execute a maximum of s instructions in parallel despite 
determined, information regarding the length of the instruc- being equipped with only s decoders. The invention also 
tion in the unit transferred to the instruction register C 52c aims to provide an instruction conversion apparatus, a 
is examined and conuol is performed as shown by the 5Q recording medium storing parallel execution codes to which 
broken lines in FIG. 2 to determine the output destinations a plurality of instructions have been assigned, and a 
of the selectors 51c and 51 d computer-readable recording medium storing an instrucUon 
While the above processor can decode instructions conversion program that generates such parallel execuUon 
regardless of the combination of short and long instructions codes. 

and regardless of how the opcodes are located in the units, 55 This primary object can be achieved by an instruction 

the bit width of the input ports for the first-third instruction conversion apparatus that includes an assigning unit for 

decoders 53a~53c is two units, which increases the overall successively assigning instructions in an instruction 

hardware scale. Putting this another way, the processor is sequence to parallel execution codes and a control unit for 

deficient in having an overly large hardware scale. The controlling the assigning unit so mat a combination of a 

processor includes selectors that switch the output destina- 60 plurality of instructions that have already been assigned to a 

tions of the instructions after referring to informaUon regard- parallel execution code and an instruction that the assigning 

ing the lengths of the instructions in the units that are unit is about to assign to the parallel execution code satisfy 

Uansferred to the instrucUon registers, so that the hardware predetermined limitations of a target processor. 

construcUon becomes increasingly complex as the number with the above instrucUon conversion apparatus, a plu- 

of instrucUon to be executed in parallel increases. 65 rality of instructions are assigned to a parallel execution 

One conventional method for reducing hardware scale is code in keeping with the predetermined limitations of the 

that described for the GMICRO/400 processor in the article processor. Accordingly, the bit width and circuit construe- 
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tions of the plurality of decoders that are including in the 
decoding unit of the processor can be simplified. 

Here, when instructions to be assigned to a parallel 
execution code include a long instruction whose word length 
is equal to at least two but no more than k unit fields, the 
assigning unit may assign one of an opcode and an operand 
of the long instruction to a u** (where u is any integer such 
that l^u^s) unit field between the 1" unit field and the s A 
unit field, and only an operand of the long instruction to unit 
fields from a (u+1) 1 * unit field to a (u+k-1)* unit field. 

With the stated construction, when up to s instructions are 
arranged into a parallel execution code, the s or fewer 
opcodes included in the s or fewer instructions are arranged 
without fail into the start of the unit fields between the 1" 
unit field and the s^ unit field. Since the s opcodes are 
arranged at the start of unit fields, parallel execution of all 
of the opcodes included in an executable code will be 
possible with only s decoders. 

Here, the instruction conversion apparatus may also 
include a grouping unit for forming an instruction group of 
a plurality of instructions that do not exhibit a dependency 
relation (hereafter "data dependency relation"), a data 
dependency relation being a relation between an instruction 
defining a resource and an instruction referring to the same 
resource; and a first detecting unit for detecting, when a V 
to an s M unit field in a parallel execution code have been 
assigned at least one instruction by the assigning means and 
an instruction (hereafter "short instruction") with a shorter 
word length than a long instruction is left in the instruction 
group, a long instruction assigned to unit fields between the 
V unit field and the s" 1 unit field, wherein the control unit 
may include a first control subunit for controlling the 
assigning unit lo rearrange instructions that have already 
been assigned to the parallel execution code so that the 
detected long instruction Is assigned to unit fields between 
the s th unit field and the (s+k-1)'* unit field and the short 
instruction remaining in the instruction group is assigned lo 
a unit field between the 1" unit field and the (s-l)' A unit 
field. 

With the stated construction, all of the opcodes included 
in a parallel execution code can be executed in parallel even 
when the 1" to s* unit fields in a parallel execution code are 
occupied by a plurality of instructions and a short instruction 
is left. 

Here, the instruction group may include instructions that 
exhibit an anti-dependence and instructions that exhibit an 
output dependence, an anti -dependence being a relation 
between an instruction that refers to a resource and an 
instruction that thereafter defines the resource, and an output 
dependence being a relation between an instruction that 
defines a resource and another instruction that defines the 
resource, the control unit may include a search unit for 
searching for a combination pattern, composed of a plurality 
of instructions in the instruction group, that is unaffected by 
an anti-dependence and an output dependence, and the 
control unit may control the assigning unit to rearrange the 
plurality of instructions in accordance with the combination 
pattern found by the search unit, to assign the long instruc- 
tion found by the detecting means to unit fields from the 5 th 
unit field to the (s+k-l)' A unit field, and to assign a short 
instruction left in the instruction group to a unit field 
between the 1" unit field and the (s-1)* unit field. 

When there is an instruction in an anti- or an output 
dependence with one of the instructions in the instruction 
group, such instruction may be assigned to a parallel execu- 
tion code to increase the number of instructions executed in 
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parallel. When doing so, the assigning of instructions in an 
order that affects the dependency is prevented beforehand. 

Here, the instruction conversion apparatus may also 
include: an address resolving unit for assigning a real 
5 address to a parallel execution code; and a second detecting 
means for detecting, when a real address has been assigned 
to a parallel execution code, an instruction including the real 
address that cannot be expressed by an original word length 
of the instruction, a flag setting unit setting the boundary flag 
10 at a unit field located one of before and after unit fields to 
which the instruction detected by the second detecting unit 
has been assigned. 

With the stated construction, processing following the 
assignment of instructions to parallel execution codes con- 
is verts the parallel execution codes into object codes and 
assigns real addresses. When the word length of any of the 
instructions needs to be increased, appropriate changes are 
made to the parallel execution codes in the converted object 
code state. As a result, there is no need lo reassign the 
2o plurality of instructions to the parallel execution codes or to 
reconvert such parallel execution codes to object codes. 
Accordingly, such processing can be performed without 
reducing the efficiency of program development. 

BRIEF DESCRIPTION OF THE DRAWINGS 

25 

These and other objects, advantages and features of the 
invention will become apparent from the following descrip- 
tion thereof taken in conjunction with the accompanying 
drawings which illustrate a specific embodiment of the 
30 invention. In the drawings: 

FIG. 1A shows a format composed of two units, 
unitl~unit2, for instructions that require a large number of 
bits, and a format composed of one unit, unit3, for other 
instructions; 

35 FIG. IB shows the unit (packet) of data that is fetched 
from memory in one cycle in the fixed -supply /variable- 
execution method; 

FIG. 1C is a block diagram showing the smallest units that 
are decoded and executed by a processor, 
40 FIG. 2 is a block diagram showing the construction of the 
instruction register and periphery in a processor that can 
execute three instructions in parallel; 

FIG. 3A is a block diagram showing the construction of 
the instruction register and periphery when the instruction 
45 issuing control method used by the GMICRO/400 is used; 
FIG. 3B shows the combinations of instructions that can 
be executed in parallel by the hardware shown in FIG. 3A; 

FIG. 4 is a block diagram showing the hardware con- 
struction of the processor of the first embodiment; 
50 FIG. 5A shows the amounts of data used when the 
instruction fetch unit 21 fetches instructions into the instruc- 
tion buffer 22; 

FIG. 5B shows the amounts of data used when the 
instruction buffer 22 outputs units to the instruction register 
55 23; 

FIG. 5C shows how the instruction register 23 issues units 
to the decoding unit 30; 

FIGS. 6A-6F show the instruction formats used by the 
60 present processor; 

FIG. 7 shows the combinations of instructions that can be 
decoded by the decoding unit 30; 

FIG. 8 shows the detailed construction of the instruction 
buffer 22; 

65 FIGS. 9A-9F show supplying of packets from the instruc- 
tion fetch unit 21 to the instruction buffer 22 and the 
outputting of units to the instruction register 23; 
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FIGS. 10A-10F show the supplying of packets from the processor execute the processing shown in FIG. 25 and an 

instruction fetch unit 21 to the instruction buffer 22 and the execution image; 

outputting of units to the instruction register 23, though fig. 27A shows example assembler codes; 

some of the units are not issued by the instruction register FIG. 27B shows an example dependency graph that 

23; 5 corresponds to FIG. 27A; 

FIG. 11 is a block diagram showing the construction of FIG. 27C shows the content of the parallel execution 

the periphery of the instruction register 23; codes- 

FIG. 12 shows the control content of the instruction FIGS. 27D, E show the codes after the addition of parallel 

issuing control unit 31, and the first instruction decoder exccution boundaries; 

33~third instruction decoder 35 when the instruction pattern _ _ . . , 

A shown in FIG. 7 is outputted to the 6rst instruction ^ 28A shows example assembler codes; 

decoder 33~third instruction decoder 35; FIG. 28B shows example dependency graph thai 

FIG. 13 shows the control content of the instruction "'responds * ™. 28A; 

issuing control unit 31, and the first instruction decoder J5 F!G - 28C shows the of P arallel execution 

33~third instruction decoder 35 when the instruction pattern codes; 

B shown in FIG. 7 is outputted to the first instruction FIG. 28D shows the codes after the addition of parallel 

decoder 33~third instruction decoder 35; execution boundaries; 

FIG. 14 shows the control content of the instruction FIGS. 29A-29B respectively show an example of the 

issuing control unit 31, and the first instruction decoder w executable codes in a program that has a conventional VUW 

33~third instruction decoder 35 when the instruction pattern processor with a fixed instruction length of 32 bits execute 

C shown in FIG. 7 is outputted to the first instruction the processing shown in FIG. 25 and an execution image; 

decoder 33~third instruction decoder 35; FIGS. 30A-30B respectively show an example of the 

FIG. 15 shows the control content of the instruction executable codes in a program that has a conventional 

issuing control unit 31, and the first instruction decoder 25 processor mat executes 32-bit instructions including parallel 

33-4hiri instruction decoder 35 when the instruction pattern execution boundary information execute the processing 

D shown in FIG. 7 is outputted to the first instruction shown in FIG. 25 and an execution image; and 

decoder 33~third instruction decoder 35; FIGS. 31A-31B respectively show an example of the 

FIG. 16 shows the control content of the instruction executable codes in a program that has a conventional 

issuing control unit 31, and the first instruction decoder 30 processor that executes 40-bit instructions including parallel 

33~third instruction decoder 35 when the instruction pattern execution boundary information execute the processing 

E shown in FIG. 7 is outputted to the first instruction decoder shown in FIG. 25 and an execution image. 

33~third instruction decoder 35; DESCRIPTION OF THE PREFERRED 

FIG. 17 shows the control content of the instruction EMBODIMENTS 

issuing control unit 31, and the first instruction decoder 3S _ . „ . « , . .. 

33-third instruction decoder 35 when the instruction pattern ™* following describes a processor that is an embodi- 

F shown in FIG. 7 is outputted to the first instruction decoder ment of * c P rescnt mention, with reference to the accom- 

33-tbird instruction decoder 35; panymg drawings. 

„_ , . , _ , . . Hardware Construction of the Processor 

FIG. 18 shows the control content of the instruction m piG. 4 is a block diagram showing the hardware con- 

*suing control unit 31, and the first instruction decoder structioa of ^ cessoT of ^ first embodimeDt . 

33~third instruction decoder 35 when the instruction pattern ^ processor executes a maximum of three instructions 

G shown in FIG. 7 is outputted to the first instruction fa ^ fa OQe ^ ^ hardware of ^ processor can 

decoder 33~third instruction decoder 35; be roughly divide(J mtQ aQ instructioQ supplying/issuing unit 

FIG. 19 shows the control content of the instruction 45 20, a decoding unit 30, and an executing unit 40. 

issuing control unit 31, and the first instruction decoder The instruction supplying/issuing unit 20 supplies sets of 

33~third instruction decoder 35 when the instruction pattern instructions that it receives from an external memory (not 

H shown in FIG. 7 is outputted to the first instruction illustrated) to the decoding unit 30. This instruction 

decoder 33~third instruction decoder 35; supplying/issuing unit 20 includes an instruction fetch unit 

FIG. 20 shows the format of parallel execution codes; 50 21, an instruction buffer 22, and an instruction register 23. 

FIG. 21 is a block diagram showing the construction of The instruction fetch unit 21 fetches instruction units 

the instruction conversion apparatus of the present embodi- (hereafter "units") from the external memory (not 

ment and the related data; illustrated) via a 32-bit IA bus (instruction address) and a 

FIGS. 22A-22F show examples of assembler codes and 64 - bil ID (instruction data) bus and stores the fetched units 

a dependency graph- 55 m an internal instruction cache. The instruction fetch unit 21 

FIG. 23A is a flowchart showing the processing of the also SU PP H< * 0Ut P utted b * the PC unit 42 t0 the 

instruction rearranging unit 121; mstrucUon buffer 22 

_~ ,, n „ , . . , FIG. 5A shows the amounts of data used when the 

FIG. 23B is a flowchart showing the processing that fetcn unit 21 fetches instructions int0 mc inslmc . 

judges whether arrangement is possible; 6Q Uon buffef 22 ^ shown [q mQ 5A> felchmg fa 

FIG. 24 is a flowchart showing the processing of the in 64-bit length blocks (hereafter called "packets") including 

address resolving unit 123 provided inside the linking unit three ^ tota , length of three units is 63 bils> ^ that 

m> one bit in the 64 bits is left unused. 

FIG. 25 is a flowchart showing an example of a process The instruction buffer 22 has two 64-bit buffers in a 

that handles a 32-bit constant; 6 5 two-stage construction, and accumulates the packets sup- 

FIG. 26A and FIG. 26B respectively show an example of plied by the instruction fetch unit 21. The instruction buffer 

the executable codes in a program that has the present 22 outputs four of the units stored in the two accumulated 
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packets to the instruction register 23. FIG. 5B shows the 
amounts of data used when the instruction buffer 22 outputs 
units to the instruction register 23. In FIG. 5B, the top level 
shows that the instruction buffer 22 outputs the first four 
units unitl, unit2, unit3, and unit4 to the instruction register 
23 out of the units unitl~unit6 that were supplied in three- 
unit packets in FIG. 5 A. The second level shows that the 
instruction buffer 22 outputs the next four units unitS, unit6, 
unit7, and unit8 to the instruction register 23 out of the units 
unit4~unit9 that were supplied in three-unit packets in FIG. 
5A. 

The instruction register 23 is composed of four 21-bit 
registers and stores the four units that are transferred from 
the instruction buffer 22. The instruction register 23 issues 
up to four of these units to the decoding unit 30. FIG. 5C 
shows how the instruction register 23 issues units to the 
decoding unit 30. The top level in FIG. 5C shows that the 
instruction register 23 first issues unitl and unit2 to the 
decoding unit 30, while the second level shows that the 
instruction register 23 next issues unit3~unit6 to the decod- 
ing unit 30. The third level shows that the instruction register 
23 then only issues unit7, the fourth level shows that the 
instruction register 23 issues unit8~unitl0 and the fifth level 
shows that the instruction register 23 issues uniill~unitl2. 
As shown in FIG. 5C, the instruction register 23 issues 
between one and four units, out of the four units transferred 
from the instruction buffer 22, to the decoding unit 30. 

The shaded parts of FIGS. 5A and SB show the bound- 
aries (parallel execution boundaries) when units are output- 
ted from the instruction register 23 to the decoding unit 30. 
As can be seen from these parallel execution boundaries, the 
supplying of units from the instruction fetch unit 21 to the 
instruction buffer 22 and the transferring of units from the 
instruction buffer 22 to the instruction register 23 are both 
performed with no relation to the output units used for 
outputting from the instruction register 23 to the decoding 
unit 30. 

The instruction issuing control unit 31 refers to the 
parallel execution boundary information and format infor- 
mation in the units stored in the four registers of the 
instruction register 23 and performs control so that two units 
are treated as one instruction when necessary. The instruc- 
tion issuing control unit 31 also performs control so that the 
issuing of units is not performed beyond a parallel execution 
boundary. 

The following first explains the construction of the 
instructions stored in the instruction register 23 and the 
storage position of the parallel execution boundary infor- 
mation flO and the format information £11. 

FIGS. 6A-6F show the instruction formats used by the 
present processor. Each instruction of the present processor 
is composed of a minimum of 21 bits, with there being both 
one-unit instructions that are 21 -bit instructions and two-unit 
instructions that are 42-bil instructions. The length of each 
kind of instruction is decided by the format information fll 
that is one bit long. When the format information fll is "0", 
one unit forms an instruction by itself, while when the 
format information fll is "1", that unit and the following 
unit together form one 42-bit instruction. 

The MSB (most significant bit) in each instruction is the 
parallel execution boundary information flO. This parallel 
execution boundary information flO shows whether a par- 
allel execution boundary is present between the present 
instruction and the following instruction. When the parallel 
execution boundary information flO is "1", a parallel execu- 
tion boundary is present between this instruction and the 
following instruction, while when the parallel execution 
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boundary information flO is "0", no parallel execution 
boundary is present between this instruction and the follow- 
ing instruction. If the first to fourth units issued by the 
instruction register 23 arc divided using the parallel execu- 
5 tion boundary information flO and the format information 
fll, these four units can be decoded as instructions in one of 
the patterns A~H shown in FIG. 7. However, due to the 
hardware construction of the decoding unit 30, the instruc- 
tions of the patterns I and J shown in FIG. 7 cannot be 
executed in parallel. This means that if a 21 -bit instruction 
is called a short instruction and a 42-bit instruction a long 
instruction, the following combinations of instructions can- 
not be executed. 



short-long-long 
long-short-short 
long-short-long 
long-long-short 
long-long-long 

20 

Also note that the instructions in the patterns A~H shown 
in FIG. 7 do not need to be simultaneously executed. When 
instructions cannot be timely supplied, parallel execution 
codes may be divided into two or more parts that are 

25 separately executed. When doing so, the parallel-executable 
instructions are processed so that instructions that are closer 
to the MSB are executed in a first cycle and instructions that 
are closer to the LSB (least significant bit) are executed in 
a following cycle. 

30 The operation of this instruction issuing control unit 31 is 
shown in more detail in other drawings. 

The instruction decoder 32 includes a first instruction 
decoder 33, a second instruction decoder 34, and a third 
instruction decoder 35 which each have an input port that is 

35 21 bits wide. These decoders fundamentally decode one 
21-bit instruction in one cycle, and send control signals to 
the executing unit 40. These decoders also transfer the 
constant operands that are located in each instruction to the 
data bus 48 of the executing unit 40. 

^ Aside from the format information fll and the parallel 
execution boundary information fl.0, FIGS. 6A-6F also 
show the operations that are indicated by various kinds of 
instructions. FIGS. 6A-6C show the formats of 21 -bit 
instructions, while FIGS. 6D-6F show the formats of 42-bit 
instructions. 

45 In these formats, transfer instructions and arithmetic 
instructions that handle long constants such as 32-bit 
constants, and branch instructions that indicate a large 
displacement are defined as 42-bil instructions. Most other 
kinds of instructions are defined as 21 -bit instructions. 

50 These instructions are such that 19 bits may be used in a 
21-bit instruction and 40 bits may be used in a 42-bit 
instruction. In detail, the format in FIG. 6A includes an 
opcode "Opl" that shows the type of operation, an "Rs" field 
that shows the register number of the register used as the 

55 source operand, and an "R" field that shows the register 
number of the register used as the destination operand. 

The format in FIG. 6B includes an opcode "Opl" that 
shows the type of operation, an "imm5" field that shows a 
5-bit immediate used as the source operand, and an "Rd" 

60 field that shows the register number of the register used as 
the destination operand. 

The format in FIG. 6C includes an opcode "Op2" that 
shows the type of operation, and a "displ3" field that shows 
a 13-bil immediate used as the source operand. 

65 The "imm5" field indicates a 5-bit constant that is used as 
an operand. The "displ3" field indicates a 13-bit displace- 
ment. 
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Each of the instructions shown in FIGS. 6A-6C may be 
inputted into one of the first instruction decoder 33~lhird 
instruction decoder 35. The opcode and any register num- 
bers in an instruction are decoded by the first instruction 
decoder 33-third instruction decoder 35 which send control 
signals showing the decoding results to the executing unit 
40. On the other hand, immediates and displacements are 
outputted by the first instruction decoder 33-third instruc- 
tion decoder 35 to the third instruction decoder 35 in their 
original form. 

The following explains the formats of 42-bit instructions. 

The formal in FIG. 6D includes an opcode "Opl" that 
shows the type of operation, a "disp21" field that shows a 
21-bil displacement used as the source operand, and an "Rd" 
field that shows the register number of the register used as 
the destination operand. 

The format in FIG. 6E includes an opcode "Op3" that 
shows the type of operation, an "imm32" field that shows a 

32- bit immediate used as the source operand, and an "Rd" 
field that shows the register number of the register used as 
the destination operand. 

The formal in FIG. 6F includes an opcode "Opl" that 
shows the type of operation, and a "disp31" field that shows 
a 31 -bit displacement used as the source operand. 

Since each of the first instruction decoder 33 to third 
instruction decoder 35 only have a 21 -bit input port, none of 
these decoders is able to receive an input of an entire 42-bit 
instruction. Accordingly, the first instruction decoder 

33- third instruction decoder 35 only receives an input of a 
part of a 42-bit instruction shown in FIGS. 6D to 6F as the 
2(r* to 39* bits, which is to say, only the first unit. The 
second unit in such an instruction is not inputted into any of 
the first instruction decoder 33~third instruction decoder 35 
and is instead inputted directly into the executing unit 40 
without passing the first instruction decoder 33~third 
instruction decoder 35. 

This second unit may skip the first instruction decoder 
33~third instruction decoder 35 for the following reason. As 
can be seen from the instruction formats shown in FIGS. 6E 
and 6F, the second of the two units that form a 42-bit 
instruction only includes part of a constant operand. This 
means that the second unit is an instruction format that does 
not include an opcode, so that the second unit does not need 
to be inputted into the first instruction decoder 33~third 
instruction decoder 35. Accordingly, such input can be 
skipped. 

The constant operand of a 42-bit instruction is therefore 
composed by linking a constant in ihe unit that is outputted 
by an instruction decoder with a constant that skips the first 
instruction decoder 33~third instruction decoder 35 and is 
directly transferred to the executing unit 40. 

The executing unit 40 is a circuit for executing a maxi- 
mum of three units in parallel, based on the control signals 
received from the decoding unit 30. This executing unit 40 
includes an execution control unit 41, a PC unit 42, a register 
file 43, a first calculating unit 44, a second calculating unit 
45, a third calculating unit 46, an operand access unit 47, and 
data buses 48 and 49. 

The execution of instructions is such that units (hereafter 
"execution units") between parallel execution boundaries 
are executed in parallel in one cycle. This means that in each 
cycle, instructions are executed as far as the first instruction 
whose parallel execution boundary information flO is "1". 
Instructions that have been supplied but which are not 
executed arc accumulated in the instruction buffer and arc 
executed in a later cycle. 

The execution control unit 41 is a general name for the 
control circuitry and wiring that controls the components 
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42-49 in the executing unit 40 according to the decoding 
results of the decoding unit 30. This execution control unit 
41 includes circuits for timing control, execution 
permission/prohibition control, status management, and 

5 interrupt control. 

The PC (program counter) unit 42 outputs an address in 
the external memory at which a next instruction to be 
decoded and executed is located to the instruction fetch unit 
21 of the instruction supplying/issuing unit 20. 

10 The register file 43 is composed of thirty-two 32-bit' 
registers numbered register R0~R31. The values stored in 
these registers are transferred to the first calculating unit 44, 
the second calculating unit 45, and the third calculating unit 
46 via the data bus 48, based on the decoding results of the 

15 first instruction decoder 33, the second instruction decoder 
34, and the third instruction decoder 35. The calculating 
units perform calculations on the register data or simply 
allow the values to pass, before outputting values to the 
register file 43 or the operand access unit 47 via the data bus 

20 49. 

The first calculating unit 44, the second calculating unit 

45, and the third calculating unit 46 each include an ALU 
(arithmetic logic unit) and multiplier that perform calcula- 
tions on two pieces of 32-bit data, as well as a barrel shifter 

25 that performs shift operations. These calculating units 
execute calculations under the control of the execution 
control unit 41. 

The operand access unit 47 transfers operands between 
the register file 43 and the external memory. When, for 

30 example, an instruction has "Id" (load) as its opcode, one 
word (32 bits) of data located in the external memory is 
loaded into an indicated register in the register file 43 via the 
operand access unit 47. When an instruction has "st" (store) 
as its opcode, the stored value of an indicated register in the 

35 register file 43 is stored into the register file 43. 

As shown in FIG. 4, the PC unit 42, the register file 43, 
the first calculating unit 44, the second calculating unit 45, 
the third calculating unit 46, and the operand access unit 47 
are all connected to the data bus 48 (LI bus, Rl bus, L2 bus, 

40 R2 bus, L3 bus, and R3 bus) and the data bus 49 (Dl bus, 
D2 bus, and D3 bus). Note that the LI bus and Rl bus are 
respectively connected to the two input ports of the first 
calculating unit 44, the L2 bus and R2 bus are respectively 
connected to the two input ports of the second calculating 

45 unit 45, and the L3 bus and R3 bus are respectively 
connected to the two input ports of the third calculating unit 

46. The Dl bus, D2 bus, and D3 bus are respectively 
connected to the outputs of the first calculating unit 44, the 
second calculating unit 45, and the third calculating unit 46. 

50 With this architecture, instructions are supplied in packets 
of a fixed length, and a suitable number of units for the 
degree of parallelism is issued based on statically obtained 
information. This method does not require any no operation 
(NOP) instructions that are issued in conventional VLIW 

55 methods with fixed-length instructions, so that the overall 
code size is reduced. 

According to the value of the format information 0.1, two 
units may be executed as one instruction or one unit may be 
executed as one instruction. As a result, a long instruction 

60 format is only used for certain instructions that require a 
large number of bits, with other instructions being defined 
using a short instruction format. This achieves a further 
reduction in code size. 

Detailed Construction of the Instruction Buffer 
65 The following describes the instruction buffer 22 in detail. 
FIG. 8 shows the detailed construction of the instruction 
buffer 22. 
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The instruction buffer 22 is composed of two 63-bit 
buffers, the instruction buffer A221 and the instruction buffer 
B222, that each store three units. The instruction buffer 
A221 is composed of three 21 -bit buffers AO, Al, and A2 
that each store one unit. In the same way, the instruction 
buffer B222 is composed of three 21-bit buffers BO, Bl, and 
B2 that each store one unit. 

The instruction buffer 22 is supplied with 64-bit packets 
by the instruction fetch unit 21. However, the MSB of the 
packet is not used as information. When a packet is received, 
the 63 valid bits in the packet are stored into one of the 
instruction buffer A221 and the instruction buffer B222 with 
no crossover between the two. The units stored in the 
instruction buffer 22 are stored in the order in which they 
were supplied, with the instruction buffer control unit 223 
managing the status of the instruction buffer 22, such as this 
supplying order and whether either instruction buffer stores 
valid data. 

The instruction buffer control unit 223 assigns a prede- 
termined transfer order to the six units stored in the instruc- 
tion buffer A221 and the instruction buffer B222, and 
controls the selectors 224a, 2246, 224c, and 224d so as to 
transfer units to the instruction registers A231-D234 in 
accordance with this order. This transfer order is determined 



The following explains the control of buffer states by the 
instruction buffer control unit 223 with reference to FIGS. 
9A-9F and FIGS. lfr-lOF. FIGS. 9A-9F show the supply- 
ing of packets from the instruction fetch unit 21 to the 
instruction buffer 22 and the outputting of units to the 
instruction register 23. In the same way, FIGS. 10A-10F 
show the supplying of packets from the instruction fetch unit 

21 to the instruction buffer 22 and the outputting of units to 
the instruction register 23, though in FIGS. 10A-10F some 
of the units are not issued by the instruction register 23. 

FIG. 9 A corresponds to when the instruction buffer 22 is 
empty and a branch is performed to the second unit in a 
packet (unit2). In this case, the packet (composed of unill, 
unit2, and unit3) including this unit2 is supplied from the 
instruction fetch unit 21, as shown in FIG. 9B, and is stored 
in the instruction buffer A221. 

Since the unit at the start of this packet is invalid, the 
instruction buffer control unit 223 performs control as 
shown in FIG. 9C so that the state of the instruction buffer 

22 is that only the buffers Al and A2 are valid. 

If in the next cycle, none of the units transferred from the 
instruction buffer 22 to the instruction register 23 is issued 
and a valid 64-bit packet composed of unit4, unit5, and unit6 
is supplied from the instruction fetch unit 21, the packet is 
transferred to the instruction buffer B222, so that the state of 



based on the order in which packets are transferred from the 25 the instruction buffer 22 changes so that buffers Al, A2, BO, 
instruction fetch unit 21 to the instruction buffer 22 and the Bl, and B2 are all valid. 

positions of the various units within these packets. In the next cycle, there is no space in the instruction buffer 

In detail, the packets stored in the instruction buffers A221 22, as shown in FIG. 9D, so that no supplied packet is 
and B222 are given a transfer order in accordance with the received from the instruction fetch unit 21. Unit2 in buffer 

order in which they were supplied from the instruction 30 Al,unit3 in buffer A2,unit4 in buffer BO, and unit5 in buffer 
supplying/issuing unit 20. Bl are transferred in order to the instruction register 23. 

The three units in each packet are given a transfer order In this way, the supplying of a packet from the instruction 



that treats the units as a first unit, a second unit, and a third 
unit, starting from the unit closest to the MSB. In order 
starting from the first unit to be received, units are trans- 
ferred from the instruction buffers A221 and B222 to the 
instruction registers A231-D234. By assigning this transfer 
order to units, a waiting queue is formed using the six units 
in the instruction buffers A221 and B222. This waiting 
queue is hereafter called the "unit queue". 

In this unit queue composed of six units, the first four 
units are transferred to the instruction registers A231-D234 
as shown in FIG. 5B. After this transfer, the four units may 
be issued from the instruction registers A231-D234 to the 
first instruction decoder 33~the third instruction decoder 35, 
as shown in FIG. 5C. Here, up to four units may be issued, 
so that there are cases when units that have not been issued 
remain in the instruction registers A231-D234. In such 
cases, the instruction buffer control unit 223 invalidates the 
units in the instruction registers A231~D234 that have been 
issued to the first instruction decoder 33~third instruction 
decoder 35 and validates the remaining units. The validated 
units are then moved upward in the unit queue. 

When a branch occurs, if the branch destination is a unit 



fetch unit 21 is only performed when there is a 63-bit space 
in the instruction buffer 22. Packets are managed in the order 
35 in which they were supplied, so that in each cycle, the four 
units that were supplied first are transferred from the instruc- 
tion buffer 22 to the instruction register 23. 

When unit2~unit5 have been issued by the instruction 
register 23, all of unitl~unit5 are invalided as shown in FIG. 
40 9E, resulting in the instruction buffer A221 becoming empty. 
As shown in FIG. 9F, this results in unit7~unit9 being 
supplied to the instruction buffer A221, so that unit6~unit9 
will be stored in the instruction buffer 221A and instruction 
buffer 222B. In FIG. 10A, these units are transferred to the 
45 instruction register 23. Of these units, unit6~unit8 are issued 
by the instruction register 23 to the first instruction decoder 
33 and second instruction decoder 34, so that only unit9 
remains in the instruction register 23. As a result, all of the 
units in the instruction buffer 222B are invalidated, as shown 
50 in FIG. 10B, and all units aside from unit9 in the instruction 
buffer 221A are invalidated. This invalidation clears the 
instruction buffer 222B so that unitl0-unitl2 are supplied to 
the instruction buffer 222B as shown in FIG. IOC. After this, 
four units starting from unit9 (unit9~unitl2) are transferred 



that is stored in the unit queue, the branch destination unit 55 from the instruction buffer 221A and instruction buffer 222B 



and following units in the unit queue are validated. Units 
positioned before the branch destination unit in the unit 
queue are invalidated. 

This invalidating and moving up of units in the unit queue 
is performed based on information showing which units in 
the instruction register 23 were not issued to the first 
instruction decoder 33-third instruction decoder 35 and on 
information showing which units in the instruction buffers 
A221 and B222 should be validated. Of these, the former 
information is received from the instruction fetch unit 21, 
while the latter information is received as feedback from the 
instruction issuing control unit 31 of the decoding unit 30. 



to the instruction register 23. Of these transferred units, 
unit9 and unitlO are issued, while unilll and unitl2 remain 
in the instruction register 23. As a result, the instruction 
buffer control unit 223 validates only unitll and unitl2 and 
invalidates the other units. In the next transfer, three units 
starting from unilll (unitll-unitl3) are transferred to the 
instruction register 23. 

Periphery of the Instruction Register 23 and Operation of the 
Instruction Issuing Control Unit 31 

The following describes the construction of the periphery 
of the instruction register 23 and the detailed operation of the 
instruction issuing control unit 31. 
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FIG. 11 is a block diagram showing the construction of FIG. 12 shows the control content of the instruction 

the periphery of the instruction register 23. In FIG. 11, issuing control unit 31, and the first instruction decoder 

arrows drawn using broken lines indicate control signals. 33~third instruction decoder 35 when the instruction pattern 

The instruction register 23 is composed of four 21 -bit A shown in FIG. 7 is outputted to the first instruction 

registers, the instruction registers A231-D234. For ease of 5 decoder 33~third instruction decoder 35. In this figure, the 

understanding this instruction register 23 is shown as setting parallel execution boundary information flO-format infor- 

a sequence of units supplied by the instruction buffer 22 as malion fll of the unit (unitl) stored in the instruction 

a unit queue. register A231 is "10". In this case, unitl forms a 21-bit 

As shown in FIG. 11, the position in the instruction instruction, so that decoding of unit2 and unit3 as instruc- 

register 23 to which a unit is transferred is unequivocally 10 lions is invalidated. This means that the instruction issuing 

determined by its position in the unit queue. This means, for control unit 31 sets the no-operation flags respectively 

example, that the first unit in the queue will be transferred outputted to the second instruction decoder 34 and the third 

to the instruction register A231 and the second unit will be instruction decoder 35 at "1". 

transferred to the instruction register B232. FIG. 13 shows the control content of the instruction 

The first instruction decoder 33~third instruction decoder 15 issuing control unit 31, and the first instruction decoder 

35 each receive an input of a 21-bit unit, decode it, and 33~third instruction decoder 35 when the instruction pattern 

output control signals relating to the operation of the ins true- B shown in FIG. 7 is outputted to the first instruction 

tion composed by this unit to the execution control unit 41, decoder 33-third instruction decoder 35. In this figure, the 

as well as outputting any constant operands located in the parallel execution boundary information flO-format infor- 

unit. 20 mation fll of the unit (unitl) stored in the instruction 

The first instruction decoder 33-third instruction decoder register A231 is "01". In this case, unitl and unit2 stored in 

35 also receive an input of a 1-bit no-operation flag as a the instruction register B232 together form a 42-bit 

control signal. When this flag is set at "1" for a decoder, the instruction, so that unit2 is not decoded as an instruction, 

decoder outputs a no operation instruction. This means that This means that the instruction issuing control unit 31 sets 

by setting the no-operation flag, the decoding of an instruc- 25 the no-operation flags respectively outputted to the second 

tion by an instruction decoder can be invalidated. instruction decoder 34 and the third instruction decoder 35 

The instruction issuing control unit 31 refers to the at'T'. 

parallel execution boundary information flO and the format FIG. 14 shows the control content of the instruction 

information fll of the units stored in the instruction register issuing control unit 31, and the first instruction decoder 

A231 and the instruction register B232, and judges which is 30 33~third instruction decoder 35 when the instruction pattern 

the final unit that should be outputted from the instruction C shown in FIG. 7 is outputted to the first instruction 

register 23 in this cycle. Based on this information, the decoder 33~third instruction decoder 35. In this figure, the 

instruction issuing control unit 31 outputs control signals parallel execution boundary information flO-format infor- 

(no-operation instruction flags) that show whether the mation fll of unitl stored in the instruction register A231 is 

decoding by the second instruction decoder 34 and third 35 "00", and the parallel execution boundary information flO- 

instruction decoder 35 should be invalidated. The instruction format information fll of the unit (unit2) stored in the 

issuing control unit 31 then transmits information showing instruction register B232 is "10". Since the format informa- 

how many units were not issued and so remain in the tion fll for both units is "0", only units up to unit2 are issued 

instruction register 23 to the instruction buffer control unit in this cycle, so that the decoding of unit3 as an instruction 

223 in the instruction buffer 22. 40 is invalidated. This means that the instruction issuing control 

As can be seen from FIG. 11, the units that can be decoded unit 31 sets the no-operation flag outputted to the third 

as instructions are only the units stored in the instruction instruction decoder 35 at "1". 

register A231, the instruction register B232, and the instruc- FIG. 15 shows the control content of the instruction 

tion register C233. The information in these units is issuing control unit 31, and the first instruction decoder 

examined, and decoding is invalidated for units that corre- 45 33~third instruction decoder 35 when the instruction pattern 

spond to the second unit in a 42-bit instruction and units that D shown in FIG. 7 is outputted to the first instruction 

are not issued. A unit that corresponds to the second unit in decoder 33~third instruction decoder 35. In this figure, the 

a 42-bit instruction is directly outputted as part of the parallel execution boundary information flO-format infor- 

constant operand of the instruction that is composed by the mation fll of the unitl stored in the instruction register 

preceding unit. 50 A231 is "00", the parallel execution boundary information 

In order to output these control signals, the instruction flO-format information fll of the unit2 stored in the instruc- 
issuing control unit 31 is internally equipped with the OR tion register B232 is "01", and the parallel execution bound- 
circuit 351 and the OR circuit 352, as shown in FIG. 11. ary information flO-format information fll of unit3 stored 

The OR circuit 351 invalidates the decoding by the in the instruction register C233 is "10". In this case, unitl 

second instruction decoder 34 if the parallel execution 55 stored in the instruction register A231 forms a separate 

boundary information flOof the unit stored in the instruction 21-bit instruction. Meanwhile, unit2 stored in the instruction 

register A231 is "1" or if the formal information fll of that register B232 and unit3 stored in the instruction register 

unit is " 1". C233 together form a 42-bit instruction, so that the decoding 

The OR circuit 352 invalidates the decoding by the third of unit3 as an instruction is invalidated. This means that the 

instruction decoder 35 if the parallel execution boundary 60 instruction issuing control unit 31 sets the no-operation flag 

information fll of the unit stored in the instruction register outputted to the third instruction decoder 35 at "1". 

B232 is "1" or if the format information fll of that unit is FIG. 16 shows the control content of the instruction 

"1". issuing control unit 31, and the first instruction decoder 

The following explains the operation of the instruction 33~third instruction decoder 35 when the instruction pattern 

issuing control unit 31~third instruction decoder 35 when 65 E shown in FIG. 7 is outputted to the first instruction decoder 

decoding the instruction patterns A~H shown in FIG. 7, with 33~third instruction decoder 35. In this figure, the parallel 

reference to FIGS. 12-19. execution boundary information flO-format information fll 
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of unitl stored in the instruction register A231 is "01", the stored in the instruction register D234 is "10". Since the 

parallel execution boundary information flO-format infor- format information fll of unitl is "0", unitl stored in the 

mation fll of the unit2 stored in the instruction register instruction register A231 forms a separate 21 -bit instruction. 

B232 is "00", and the parallel execution boundary informa- In the same way, the format information fll of unit2 is "0", 

tion flO-format information fll of unit3 stored in the 5 so that unit2 stored in the instruction register B232 forms a 

instruction register C233 is "10". Since the format informa- separate 21 -bit instruction. On the other hand, the formal 

tion fll of unitl is "1", unitl and unit2 in the instruction information fll of unit3 is "1", so that together with unit4 

register B232 together form a 42-bit instruction. On the in the instruction register D234, unit3 stored in the instruc- 

other hand, unit3 forms a separate 21 -bit instruction and so tion register C233 forms a 42-bit instruction. These two 

needs to be decoded. In this case, the instruction issuing 10 21 -bit instructions and single 42-bit instruction are decoded 

control unit 31 sets only the no-operation flag outputted to in parallel by the first instruction decoder 33-third instruc- 

the second instruction decoder 34 at "1". tion decoder 35, 

FIG. 17 shows the control content of the instruction As described above, the processor of the present embodi- 

issuing control unit 31, and the first instruction decoder ment can decode up to four units in a sequence of units as 

33-third instruction decoder 35 when the instruction pattern is instructions. This means that the patterns A-H shown in 

F shown in FIG. 7 is outputted to the first instruction decoder FIG. 7 can be issued, meaning that a maximum of four units 

33~third instruction decoder 35. In this figure, the parallel can be issued at once. However, out of the possible patterns 

execution boundary information flO-format information fll composed of four units, the patterns I-J in FIG. 7 have the 

of unitl stored in the instruction register A231 is "01", the opcode of the third instruction located in the instruction 

parallel execution boundary information flO-format infor- 20 register 234D, so that these instructions cannot be decoded, 

mation fll of the unit2 stored in the instruction register However, out of the patterns that include one 42-bit 

B232 is "00", the parallel execution boundary information instruction, even the pattern H in FIG. 7 can be executed in 

flO-format information fll of unit3 stored in the instruction parallel. This means that even if a processor only has three 

register C233 is "01", and the parallel execution boundary decoders with 21-bit input ports, three instructions including 

information flO-format information fll of unit4 stored in the 25 one long instruction can still be executed in parallel, 

instruction register D234 is "10". Since the format informa- Second Embodiment 

tion fll of unitl is "1", unitl and unit2 in the instruction In the processor of the first embodiment, instructions are 

register B232 together form a 42-bit instruction. The format supplied using packets that are outputted to the instruction 

information fll of unit3 is also "01" so that unit3 and unit4 buffer 22 and instructions are executed using "execution 

in the instruction register D234 together form another 42-bit 30 units" that are outputted from the instruction register 23. 

instruction. In this case, the instruction issuing control unit This second embodiment relates to an instruction conversion 

31 sets only the no-operation flag outputted to the second apparatus that generates a sequence of packets that are suited 

instruction decoder 34 at "1". to the processor described in the first embodiment. This 

FIG. 18 shows the control content of the instruction instruction conversion apparatus generates codes that cor- 

issuing control unit 31, and the first instruction decoder 35 respond to the "execution units" described in the first 

33~third instruction decoder 35 when the instruction pattern embodiment, and then converts these codes into the object 

G shown in FIG. 7 is outputted to the first instruction codes that correspond to the packets. These codes that 

decoder 33~third instruction decoder 35. In this figure, the correspond to "execution units" are called "parallel execu- 

parallel execution boundary information flO- format infor- tion codes" in this second embodiment, 

mation fll of unitl stored in the instruction register A231 is 40 FIG. 20 shows the format of parallel execution codes. In 

"00", the parallel execution boundary information AO- FIG. 20, the possible sizes of the parallel execution codes are 

format information fll of unit2 stored in the instruction 21 bits, 42 bits, 63 bits, and 84 bits. Here, 84-bit parallel 

register B232 is "00", and the parallel execution boundary execution codes can be used to assign the combinations of ( 

information flO-format information fll of unit3 stored in the short and long instructions shown as patterns F, H, I and J 

instruction register C233 is "10". Since the format informa- 45 in FIG. 7, and 63-bit parallel execution codes can be used to 

lion fll of unitl is "0", unitl stored in the instruction register assign the combinations of short and long instructions 

A231 forms a separate 21 -bit instruction. In the same way, shown as patterns D, E, and G in FIG. 7. In the same way, 

the format information fll of unit2 is "0", so that unit2 42-bit parallel execution codes can be used to assign the 

stored in the instruction register B232 forms a separate combinations of short and long instructions shown as pat- 

21 -bit instruction. Also, the format information £11 of unit3 50 terns B and C in FIG. 7, and a 21-bit parallel execution code 

is "0", so that unit3 stored in the instruction register C233 can be used to assign one short instruction, as shown by 

forms a separate 21-bit instruction. These three 21 -bit pattern A in FIG. 7. These parallel execution codes include 

instructions are decoded in parallel by the first instruction internal fields (unit fields) that are each 21 -bits in size. One 

decoder 33~third instruction decoder 35. 21-bit unit described in the first embodiment can be assigned 

FIG. 19 shows the control content of the instruction 55 to each of these unit fields. The unit fields in parallel 

issuing control unit 31, and the first instruction decoder execution code are assigned numbers starting from the 

33~third instruction decoder 35 when the instruction pattern MSB, and so are respectively called the first, the second, the 

I I shown in FIG. 7 is outputted to the first instruction third, and the fourth unit fields. Of these unit fields, the first 

decoder 33— third instruction decoder 35. In this figure, the to third unit fields can be decoded in order by the first 

parallel execution boundary information flO-format infor- 60 instruction decoder 33~third instruction decoder 35. 

mation fll of unitl stored in the instruction register A231 is When the pattern D in FIG. 7 is assigned, a short 

"00", the parallel execution boundary information flO- instruction is assigned to the first unit field of 63-bit parallel 

format information fll of unit2 stored in the instruction execution code and a long instruction is assigned to the 

register B232 is "00", the parallel execution boundary second and third unit fields in the 63-bit parallel execution 

information flO-format information fll of unit3 stored in the 65 code. When the pattern E in FIG. 7 is assigned, a long 

instruction register C233 is "01", and the parallel execution instruction is assigned to the first and second unit field of 

boundary information flO-format information fll of unil4 63-bit parallel execution code and a short instruction is 
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assigned to the third unit field in the 63-bit parallel execution resources, such as the processor and memory, of the general- 
code. When the pattern H in FIG. 7 is assigned, two short purpose computer on which the software is run. Software 
instructions are assigned to the first and second unit fields of that has such a complicated processing content is generally 
84-bit parallel execution code and a long instruction is composed of a number of subroutines and work areas, so 
assigned to the third and fourth unit fields in the 84-bit 5 ma t each of these subroutines and work areas should be 
parallel execution code. considered a separate construction element. However, it is 
Note that when two or more instructions are assigned to common for such subroutines and work areas to be arranged 
a parallel execution code, there are cases where parallel mto a Ubrary by a conventional operating system, compiler, 
execution is not possible. As one example, when the sup- or and such components will not be explained here, 
plying of instructions from the instruction supplying/issuing 10 Accordingly, the following explanation will focus on the 
unit 20 of the processor in the first embodiment cannot keep factions of the subroutines and work areas that are required 
up with the decoding of instructions by the decoding unit 30, t0 achievc the fictions of an instruction conversion appa- 
the two or more instructions assigned to the same parallel ralus 

execution code will be executed in two or more cycles. This H q 2 1 is a block diagram showing the construction of 

means that only an instruction positioned in the first unit J5 ±c instruc tion conversion apparatus of the present embodi- 

field of the parallel execution code is executed in a first ment and related data 

cycle, with the instruction positioned in the second unit field ^ construction of the present instruction conversion 

of the parallel execution code oeing executed in the next a pp ara tus can be broadly divided into the following two 

cycle. Accordingly, the instruction conversion apparatus has groups 11ie firet group gen c ra t es object codes 160 from 

to assign short and long instructions to unit fields in a way 20 ^unx mdcs 150 thal are m a high _ levd . anguage , 

that proper execution will be properly performed even if the comprises the compiler upstream part 110, the assembler 

plurality of instructions in a set of parallel execution code ^ gcne raUng unit 111, the instruction scheduling unit 112, 

are executed in two or more cycles. and thc ob j cct code generating unit 113, and corresponds to 

Trie setting of the lengths of sets of parallel execution a conventional compiler. The second group links a plurality 

code at 21, 42, 63, or 84 bits can be made by the instruction ^ of object oodts 160 and generates the final executable codes 

conversion apparatus setting the parallel execution bound- 70 comprises the linking unit 114, and corresponds to a 

anes shown in the first embodiment in the parallel execution conventional linker, 

codes. Parallel execution codes thai can have one of four Compiler Upstream Part 110 

lengths are serially arranged, and are then divided into 63-bit compiler upstream part 110 reads the source program 

lengths. In this way, the packet sequence shown in the first 30 150 mat is stored ^ a filc ^ program 150 is 

embodiment is obtained as a sequence of object codes. m a hig0 _| evel language, so that the compiler 

Hie parallel execution codes generated in this way must upslr e am part 110 performs a syntactic and semantic analy- 

satisry the two conditions given below. sis on the program 150 and generates internal rep- 

The first condition is that the plurality of instructions rcscmxioQ codes and an internal representation program 

included in a parallel execution code do not violate the 35 composed of a plurality of internal representation codes. The 

restrictions of the processor regarding the available com- compiler upstream part 110 also optimizes this internal 

puting resources. representation program as necessary to reduce the code size 

The second condition is that the instructions are assigned and/or execution time of the executable codes that arc finally 

within the parallel execution code in accordance with thc generated 

restrictions on parallel execution by the processor. ^ Assembler Code Generating Unit 111 

Tne restrictions regarding thc instructions that can be ^ asscra bler code generating unit 111 generates assem- 

arranged between the parallel execution boundaries are as bIer codes from the internal representation codes that have 

toilows. becn g enera ted and optimized by the compiler upstream part 

(1) The total number of instructions in a parallel execution no by doing so generates an assembler program corn- 
code does not exceed three. 45 p0S ed of a plurality of assembler codes. 

(2) The total number of resources in the processor used by The processing of the compiler upstream part 110 and 
the instructions in a parallel execution code does not assembler code generating unit 111 does not relate to the gist 
exceed three ALUs, 1 LD/ST unit and a branch unit. of the present invention and may be achieved through the 

(3) The combination of instruction sizes in a parallel processing performed by a conventional instruction conver- 
execution code is one of the patterns A~H shown in 50 sion apparatus. Accordingly, such processing will not be 
FIG. 7. described in this specification. When assembler codes are 

Construction of the Instruction Conversion Apparatus generated, it is assumed that it is possible to judge whether 

The following describes the instruction conversion appa- the assembler codes correspond to long instructions or short 

ratus of the present embodiment, with reference to the instructions. Note that assembler codes that include a dis- 

drawings. This instruction conversion apparatus is of a 55 placement as an operand are provisionally assumed to short 

format that is conventionally used in the art, which is to say, instructions at this stage, 

a recording medium storing executable software for a com- Instruction Scheduling Unit 112 

piler and linker that have the equivalent functions of an The instruction scheduling unit 112 analyzes dependen- 
instruction conversion apparatus. Such recording media are cies between instructions in the assembler codes generated 
generally distributed and sold as software packages. A user 60 by the assembler code generating unit 111, performs inslruc- 
can purchase and install such a software package into a tion scheduling (reordering of instructions), and adds par- 
general-purpose computer that can thereafter function as an allel execution boundaries, assigning assembler codes that 
instruction conversion apparatus simply by processing can be executed in parallel to a same parallel execution code, 
according to the installed software. Since this is the common When doing so, the instruction scheduling unit 112 also 
method for implementing an instruction conversion 65 considers the case where instructions assigned to a same 
apparatus, the software for achieving an instruction conver- parallel execution code are executed separately in two 
sion apparatus is more important than the hardware cycles, and assigns instructions to unit fields so as to ensure 
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that there will be no breakdown in the dependencies even if In the game, the dependency graph generated by the depen- 

the instructions are executed in different cycles. To perform dency analyzing unit 120 is considered to be a tree whose 

such assigning, the instruction scheduling unit 112 includes branches are combinations of nodes and edges. Nodes that 

a dependency analyzing unit 120, an instruction rearranging are indicated by an edge but do not themselves indicate any 

unit 121, and a parallel execution boundary appending unit 5 other edges (nodes 1, 5, and 8 in FIG. 22C) are considered 

122. To simplify the explanation, the instruction scheduling to be ^ end branches. 

unit 112 is assumed here to process the assembler codes in In pjc 2 2D, the player selects node 1 out of the end 

basic block units. branches and cuts off this node. Once node 1 has been 

The dependency analyzing unit 120 analyzes the depen- removed, node 2 becomes an end branch, so that the player 

dencies between instructions in a basic block and produces 10 next and cuts off one node oul of the end branches 

a dependency graph. In this specification, there are the nodes 2, 5, and 8. In FIG. 22E, the player selects node 8 out 

following three types of dependencies between instructions: of ^ cnd branches and cuts off this node. 

data dependence — dependency between an instruction The pi ayer continues to cut off branches as described 

that defines a resource and an instruction that refers to above, with the nodes in the cut-off branches being arranged 

the same resource; J5 j n to a parallel execution code in the order in which the nodes 

anti-dependence — dependency between an instruction are cut off. An arrangement of parallel execution codes that 

that refers to a resource and an instruction that defines respects the dependencies in the program is obtained when 

the same resource; and all 0 f the branches have been cut off the tree. The lower the 

output dependence — dependency between an instruction number of parallel execution codes, the higher the score of 

that defines a resource and another instruction that 20 the player (which is to say, the better the parallel execution 

defines the same resource. codes). This completes the description of the branch-cutting 

Rearranging the original order of instructions so that game as an analogy to the procedure for rearranging nodes, 

instructions that exhibit any of the above types of depen- The instruction rearranging unit 121 performs this rear- 

dencies are interchanged will affect the meaning of the ranging in accordance with the procedure in the flowchart 

program. Accordingly, such dependencies need to be main- 25 shown in FIG. 23A. In this explanation, the expression 

tained when rearranging the instructions. "arranging" refers to the processing that assigns up to three 

The dependency analyzing unit 120 refers to the result of instructions in the four unit fields in a parallel execution 

its analysis, generates a node for each instruction that is code. An arrangement of instructions whose assignment to a 

included in a basic block, and generates edges (arrows) parallel execution code may be changed is called a provi- 

joining pairs of instructions that exhibit a dependency. As 30 sional arrangement, while an arrangement that will not be 

one example, FIG. 22B shows a dependency graph that changed is called a definite arrangement, 

corresponds to the assembler codes shown in FIG. 22A. In The expression "arrangement candidate" refers to a node 

FIG. 22A, instructionl "Id (meml),R0" and instruction that corresponds to an end branch in the branch -cutting 

"add 1,R0" have a data dependency regarding register R0. In game described above, which can be a node that has no 

the same way, instruction^ "add 1, R0" and instruction3 "st 35 predecessors or a node whose predecessors have all been 

R0,(mem2)" have a data dependency regarding register R0. provisionally arranged. The nodes in the dependency graph 

Instruction? "st R0,(mem2)" and instruction4 "mov that are currently arrangement candidates change as the 

Rl ,R0" have an anti-dependence regarding register R0. process arranging instructions into parallel execution codes 

In the same way, instruction4 "mov R1,R0" and instruc- progresses. 

tion6 "add R3,R0" have a data dependency regarding reg- 40 The following explanation describes each step in the 

ister R0, instructions "mov R2,R3" and instruction6 "add arrangement process. In step SO, the instruction rearranging 

R3,R0" have a data dependency regarding register R3, and unit 121 sets the variable i at "1". This variable i indicates 

instructionfi "add R3,R0" and instruction? "st R0,(mem3)" one of the parallel execution codes included in the object 

have a data dependency regarding register R0. program that will be generated by the processing hereafter. 

Instructions that exhibit a data dependency are joined in 45 In this example, each parallel execution code has an initial 

FIG. 22B by solid lines, while instructions that exhibit an length of 84 bits. The following step, step SI, forms a loop 

anti-dependence or an output dependence are joined by process (loopl) together with step S10. As a result, the 

broken lines. In FIG. 22B, instruction4 "mov R1,R0", processing in steps S2-S9 is repeated for each node in the 

instructions "mov R2,R3", and instruction "add R3,R0" dependency graph generated by the dependency analyzing 

are joined in a Y shape, with instruction4 "mov R1,R0" 50 unit 120. 

being further joined by a broken line to instruction3 "st In step S2, the instruction rearranging unit 121 extracts all 

R0,(mem2)'\ In this dependency graph, the arrows are nodes that are assignment candidates for a present parallel 

interpreted as the output order that should be respected when execution code from the dependency graph and forms an 

issuing instructions from the instruction registers arrangement candidate group of such nodes. In the first 

A231-D234 to the instruction decoders 33-35. 55 iteration of loopl, nodes that have no predecessors are 

A dependency graph may be generated according to a selected to form the arrangement candidate group, 

conventional method, such as that disclosed in the paper Step S3-S8 include loop statements (loop2) forming a 

Instruction Scheduling in the TOBEY compiler (R. J. loop that determines which nodes in the arrangement can- 

Blainey, IBMJ.RES.DEVELOP. Vol 38 No. 5 September didate group formed in step S2 should be assigned to a same 

1994). 60 parallel execution code. This loop process can end due to 

The instruction rearranging unit 121 refers to the depen- any of two circumstances. The first circumstance is when all 

dency graph generated by the dependency analyzing unit of the arrangement candidates in the arrangement candidate 

120 and rearranges the instructions in a basic block, assign- group have been arranged into a parallel execution code so 

ing one or more instructions to each parallel execution code. that no assignment candidates remain. This corresponds in 

This rearranging by the instruction rearranging unit 121 is 65 the branch-cutting game to a case where there are few end 

analogous to a game where branches are cut off a tree. FIGS. branches (which is to say, there are few arrangement 

22A-22F show the procedure of this branch-cutting game. candidates). There are cases where no assignment candi- 
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dates remain after only one or two iterations of loop2. In more than one node (instruction) satisfy this criterion, the 

such cases, loop2 ends due to this first circumstance. instruction that comes first in the original instruction 

The second circumstance is where the four unit fields in sequence is selected, 

the present parallel execution code have been filled with In step S5, the instruction rearranging unit 121 judges 

arrangement candidates, so that there is no more room in the 5 whether the most suitable node can be arranged into the 

parallel execution code. In this second circumstance, some present parallel execution code, according to the procedure 

of the arrangement candidates in the arrangement candidate shown in FIG. 23B. When this is not possible, the processing 

group cannot be arranged into the parallel execution code advances to step S8 so that the processing in steps S4~S7 

and so are left behind. will be performed for a different assignment candidate in the 

In step S9, the nodes that are to be arranged into the 10 assignment candidate group, 

parallel execution code are determined, regardless of which When it is possible to arrange the most suitable node into 

of the two circumstances resulted in the exit from loop2. In the parallel execution code, the processes moves from step 

detail, the instructions that correspond to the nodes in the S5 to step S6. In step S6, the instruction rearranging unit 121 

arrangement candidate group are extracted from the original judges whether there is sufficient space in the 84-bit parallel 

instruction sequence and parallel execution boundaries are 15 execution code to arrange the present arrangement candi- 

added by the parallel execution boundary appending unit date. If not, the processing leaves loop2 and returns to step 

122 shown in FIG. 21. When only one short instruction is S9. If so, the judgement "Yes" is made in step S6 and the 

determined as being arranged into the parallel execution processing advances to step S7. 

code, in step S9 a parallel execution boundary is set for this As a general rule, the processing in steps S4-&6 is 

short instruction. By doing so, the parallel execution code is 20 repeated and the instructions are progressively assigned to 

set as having a data length of 21 bits. When one long parallel execution codes. It should be noted here that even if 

instruction is determined as being arranged into the parallel there is still space in a parallel execution code for the 

execution code, in step S9 a parallel execution boundary is arrangement of another instruction, there will still be cases 

set for this long instruction. By doing so, the parallel where no instruction will be arranged due to there being no 

execution code is set as having a data length of 42 bits. In 25 more arrangement candidates. When there is only one 

the same way, when a combination of one short and one long assignment candidate, processing of all the assignment 

instruction is determined as being arranged into the parallel candidates will be completed by a single iteration of loop2, 

execution code, in step S9 a parallel execution boundary is so that the processing will then return to step S9. However, 

set for the long instruction in the combination. By doing so, if nodes could somehow be added as assignment candidates 

the parallel execution code is set as having a data length of 30 when the number of assignment candidates is low, further 

63 bits. iterations of loop2 would be possible. Nodes that have an 

When a short-short-long instruction combination is deter- anti-dependence or an output dependence with the most 

mined as being arranged into the parallel execution code, in suitable node are nodes that were not selected as arrange- 

step S9 a parallel execution boundary is set for the long ment candidates in step S2 but which may be later added as 

instruction in the combination. By doing so, the parallel 35 assignment candidates. Such nodes cannot be executed 

execution code is set as having a data length of 84 bits. before the most suitable node, but can be executed in the 

In step SI, variable i is incremented by "1" so as to make same cycle as the most suitable node. As a result, when the 

it indicate the next parallel execution code into which judgement "Yes" is given in the flowchart in FIG. 23A, the 

instructions arc to be arranged. The processing then returns processing moves to step S7 and nodes that have only the 

to step S10. 40 most suitable node that is presently being arranged as a 

When the processing moves to step S2 in a second or later predecessor and have an anti- or an output dependence with 

iteration of loopl, the provisional arrangement of one of the the most suitable node are added to the arrangement candi- 

instructions will have been completed. As a result, a node date group as arrangement candidates. After this, the pro- 

that has the provisionally arranged instruction as a prede- cessing moves to step S8 so that the processing in steps 

cessor can hereafter be selected as part of the arrangement 45 S4-S7 is performed for the newly added arrangement can- 

candidate group. didates. 

When loop2 ends due to the second circumstance, the The following describes method used in FIG. 5 to judge 

nodes that were not arranged and so were left behind are also whether arrangement is possible, with reference to the 

selected as arrangement candidates. This shows that the flowchart shown in FIG. 23B. 

nodes in the dependency graph that are selected as arrange- 50 In step Ul, the instruction rearranging unit 121 checks 

ment candidates change according to which nodes have been whether the instructions included in the present parallel 

provisionally arranged into a parallel execution code and to execution code satisfy the restrictions set by the number of 

which nodes could not be provisionally arranged into the calculating resources. In detail, the instruction rearranging 

parallel execution code and so were left behind. unit 121 judges whether the processor will be able to 

In loop2, the instruction rearranging unit 121 performs the 55 simultaneously process the instruction being judged in addi- 

processing described below (steps S4-S7) for each arrange- tion to the instructions that have already been provisionally 

ment candidate in the arrangement candidate group. arranged into the parallel execution code. If not possible, the 

Step S4 corresponds to the player of the branch -cutting instruction rearranging unit 121 judges that the present 

game selecting an end branch to cut. In step S4, the node that instruction cannot be arranged into the parallel execution 

is considered to be the most suitable for arranging at the 60 code. 

present time is taken from the arrangement candidate group. Next, in step U2, the instruction rearranging unit 121 

The instruction rearranging unit 121 selects this most suit- judges whether the number of instructions that have already 

able node by heuristically selecting an instruction whose been provisionally arranged into the present parallel execu- 

arrangemcnt is believed to cause the greatest reduction in tion code is less than the number of decoders in the proces- 

execution time for all instructions in the basic block. Here, 65 sor minus one. If so, the instruction rearranging unit 121 

a node situated at an end of the branch in the dependency judges that the present instruction can be arranged into the 

graph with the longest total execution lime is selected. When parallel execution code and the processing advances to step 
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U9. In this example, the number of decoders provided in the this processed instruction will have an anti-dependence or 

processor of the first embodiment is three, so that the output dependence with one or more of the provisionally 

judgement in step U2 is satisfied if 0 or 1 instructions have arranged instructions. In the example shown in FIG. 22B, a 

been provisionally arranged. When this is the case, the broken-line edge is present between instruction3 "st RO, 

instruction presently being analyzed (also referred to as the 5 (mem2)" and instruction4 "mov Rl.RO", showing that an 

"processed instruction") will definitely fit into the parallel anti-dependence exists between these instructions. In this 

execution code regardless of whether it is a short or long dependency graph, there will be no problems if instructions 

instruction, so that the processing proceeds to step U9. "st R0,(mem2)"~instruction5 "mov R2,R3" are assigned to 

When the number of instructions that have already been the unit fields of the parallel execution code i in the order 

provisionally arranged into the present parallel execution 10 instruction3 "st R0,(mem2)"-instructionS "mov R2,R3"- 

code is not less than the number of decoders in the processor instruction4 "mov R1,R0". This is because even if the 

minus one, the judgement "No" is given in step U2 and the circumstances of the target processor dictate that instruc- 

processing proceeds to step U3. In step U3, the number of tion3 "st R0,(mem2)" is executed in a different cycle to 

instructions that have already been provisionally arranged is instructions "mov R2,R3" and instruction4 "mov R1,R0", 

two, so that a judgement is performed to see whether both 15 instruction3 "st R0,(raem2)" will be executed first, with 

instructions are short instructions. Here, when two short instructions "mov R2,R3" and instruction4 "mov R1,R0" 

instructions have already been arranged into the parallel being executed later. Consequently, the anti-dependence 

execution code i, the processed instruction will definitely fit between the instructions is properly maintained, 

into the parallel execution code i regardless of whether it is If instruction3 "st R0,(mem2)"~instruction5 "mov 

a short instruction or a long instruction. This is because the 20 R2,R3" are assigned to the unit fields of the parallel execu- 

larget processor is capable of executing both short-short- tion code i in the order instruction4 "mov R1,R0"- 

short and short-short-long instruction combinations. instructions "mov R2,R3"-instruction3 "st R0,(mem2)'\ 

Consequently, the processing advances to step U9. however, there is the risk that the anti-dependence will be 

In step U9, the processed instruction is provisionally broken. This is because the circumstances of the target 

arranged into the parallel execution code. When no instruc- 25 processor may dictate that instruction4 "mov R1,R0" is 

tions have yet been arranged into the parallel execution code executed in a different cycle to instructions "mov R2,R3" 

i, the processed instruction is arranged into the first unit field and instruclion3 "st R0,(mem2)". If so, instruction4 "mov 

in the parallel execution code. When instructions have been R1,R0" will be executed first, with instructions "mov 

arranged into the first-third unit fields of the parallel execu- R2.R3" and instructions "st R0,(mem2)" being executed 

tion code i, the processed instruction is arranged into the first 30 later. This results in the anti-dependence being broken. In 

open unit field in the parallel execution code i. In detail, this way, when two arrangement candidates that exhibit 

when an instruction has already been arranged into the first dependency are arranged into the same parallel execution 

unit field, the processed instruction is arranged into the code, there is the risk of an anti-dependence being broken, 

second unit field. Conversely, when one or two instructions so that the analysis of dependencies in step U6 is required, 

have already been arranged into the first and second unit 35 In step U7, the instruction rearranging unit 121 refers to 

fields, the processed instruction is arranged into the third the results of the analysis performed in step U6 and judges 

unit field. whether it is possible to rearrange the instructions that have 

When the judgement in step U3 is negative, the process- been provisionally arranged and the processed instruction to 
ing advances to step U4. In step U4, the instruction rear- produce a short-short-long instruction arrangement. When 
ranging unit 121 judges whether the instructions arranged 40 there is no anti-dependence or output dependence in the 
into the first-third unit fields in the parallel execution code program between the processed instruction and the provi- 
i are a short-long instruction combination or a long-short sionally arranged instructions, these instructions may be 
instruction combination. Here, if the provisionally arranged rearranged to produce a short-short-long instruction 
instructions are a long-long combination, it will not be arrangement, so that the instruction rearranging unit 121 
possible for a further instruction to be executed in parallel, 45 rearranges the instructions in this way. Conversely, when 
so that the arrangement of the processed instruction is there is anti-dependence or output dependence in the pro- 
judged to be impossible. Conversely, when the provisionally gram between the processed instruction and the provision- 
arranged instructions are one of the two combinations given ally arranged instructions, a short-short-long arrangement 
above, the processing advances to step US. where the anti- or output dependence is not broken is 

In step U5, the instruction rearranging unit 121 judges so selected. If the anti- or output dependence is broken regard- 

whether the processed instruction that it is trying to arrange less of how the short instructions are arranged, arrangement 

is a short instruction. If the processed instruction is a long of the processed instruction in the present parallel execution 

instruction, arrangement of this instruction will produce a code is judged to be impossible. If there is an arrangement 

long-short-long or short- long-long instruction combination where the dependency is not broken, the instructions are 

in the parallel execution code i, neither of which can be 55 rearranged in accordance with such arrangement, 

executed by the target processor. Consequently, the instruc- Step U8 is performed if the judgement in step S7 is 

tion rearranging unit 121 judges that arrangement is impos- affirmative. The instruction rearranging unit 121 arranges 

sible. the processed instruction and rearranges the provisionally 

On finding that the processed instruction in step U5 is a arranged instructions into the alignment that satisfies the 

short instruction, the instruction rearranging unit 121 uses 60 criteria judged in step U7. 

the dependency graph to analyze any dependencies between Object Code Generating Unit 113 

the processed instruction and instructions in the program The following explanation returns to FIG. 21 to describe 

that have already been provisionally arranged. Here, depen- the components of the instruction conversion apparatus. The 

dencies between the arrangement candidates are analyzed object code generating unit 113 divides the parallel execu- 

because arrangement candidates may have been added in 65 tion codes, which have been assigned instructions and given 

step S7 in FIG. 23 A. In detail, if the processed instruction is parallel execution boundaries by the instruction scheduling 

a node that was added in step S7, there is a possibility that unit 112, into packet units. The packet sequence that is made 
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up of the packets produced by this division are then stored 
in a file as relocatable object codes and the resulting file is 
output ted. 
Linking Unit 114 

The linking unit 114 links a plurality of relocatable object 
codes that were generated in different compiling units to 
produce one linked sequence, refers to symbol information 
and calculates the final address of each label, and determines 
the size of each label. The symbol information referred to 



for the instruction in the parallel execution code. As a result, 
once an unresolved instruction has been extended to become 
a long instruction, step V9 judges whether the parallel 
execution code still satisfies one of the patterns in A~H 
shown in FIG. 7. If this is not the case, the processing 
proceeds to step V6 where a parallel execution boundary is 
inserted before or after the unresolved instruction to ensure 
that parallel execution will still be possible. 
When the calculated displacement cannot be expressed by 



here is information showing the actual address of the parallel 10 f. 21 " bil value ' the judgement "Yes" is given fa .step V4 and 



execution code to which each label in the object code is 
assigned. 

The linking unit 114 of the present invention differs from 
a conventional linker by including an address resolving unit 
123. The address resolving unit 123 resolves addresses in 
object code that include unresolved addresses and can be 
realized by software that executes the procedure shown in 
FIG. 24. 

FIG. 24 is a flowchart showing the procedure executed by 
the address resolving unit 123 which forms part of the 
linking unit 114. 

In step VO, the address resolving unit 123 extracts all 
instructions (hereafter called "unresolved instructions") that 
include an unresolved label from the object codes that have 
been assigned addresses. Step V10 is a loop statement for 25 
having the processing in step VI -step V9 repeated for each 
instruction extracted in step VO. In step VI, the address 
resolving unit 123 refers to the symbol information and 
calculates a displacement to the branch or reference desti- 
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the processing proceeds to step V7. When the calculated 
displacement exceeds 21 bits, the displacement cannot be 
written even if the unresolved instruction is expanded to 
become a long instruction. In this case, the unresolved 
instruction is processed by replacing it with a long instruc- 
tion (1) and a short instruction (2). The processing content 
of these instructions is as follows. 

Long instruction (1): transfer instruction that transfers an 
address into a register. 

Short instruction (2): instruction that executes the same 
processing as the unresolved instruction in addressing mode 
using the register into which the address has been trans- 
ferred. 

The register that is used in addressing mode is specially 
reserved for this division of instructions. 

In step V7, there is a data dependency over the register 
between the long instruction (1) and the short instruction (2) 
used to replace the unresolved instruction, meaning that 
these instructions cannot be executed simultaneously. 
Consequently, step V8 inserts a parallel execution boundary 



nation from the address of the unresolved instruction. When 30 bel " een ">e kmginstrurtion (1) and the short instruction (2). 

As a result of the above processing, even if the determi- 
nation of an unresolved address in the linking process results 
in a change in the length of instructions, it is still guaranteed 
that parallel execution codes which can be executed by the 
target processor will be oulputted. 

As described above, when three instructions to be 
executed in parallel are composed of two short and one long 
instructions, the instruction conversion apparatus of the 
present invention rearranges the instructions into a short- 
short-long instruction pattern. Since both short instructions 



the address of the unresolved instruction is close to the 
branch or reference destination, a small value will be given 
as the displacement, while the address of the unresolved 
instruction is far from the branch or reference destination, a 
large value will be given as the displacement. 

Once the displacement has been calculated, the process- 
ing advances to step V2, where the address resolving unit 
123 judges whether the displacement can be expressed by a 
5-bit value. If so, the processing advances to step V3. 



When the assembler codes are rearranged, instructions 40 and long instructions have their opcodes located in the first 



that include displacements are regarded as short instructions 
and are arranged into parallel execution codes as such. When 
the displacement can be expressed by a 5 -bit value, the 
displacement can be written into the operand of a short 
instruction without causing any problems. As a result, the 
determined displacement is written into the unresolved 
instruction, thereby completing the processing of the present 
unresolved instruction. 

On the other hand, when the determined displacement 
cannot be expressed by a 5 -bit value, the displacement 
cannot be written into the operand of a short instruction. As 
a result, the judgement "Yes" is given in step V2 and the 
processing proceeds to step V4. In step V4, the address 
resolving unit 123 judges whether the displacement cannot 
be expressed by a 21 -bit value. If not, the judgement "No" 
is given and the processing advances to step V5. In other 
words, the displacement can be written as an operand if the 
unresolved instruction is converted to a long instruction, so 
that in step V5, the instruction size of the unresolved 



instruction unit, the above instruction pattern has all opcodes 
arranged in the first three instruction units. In such case, the 
decoders of the target processor can decode the first three 
units in a parallel execution code and so have the processor 
45 execute the maximum of three instructions in parallel. 
Supplementary Explanation for the First Embodiment 
Operation of the Processor 

The following describes the operation of the processor of 
the first embodiment when decoding and executing specific 
50 instructions. 

FIG. 25 is a flowchart showing an example of a process 
that handles a 32-bit constant. 

In FIG. 25, the 32-bit constant "0x87654321" is trans- 
ferred into register Rl (step S100). The stored value of 
55 register R5 is transferred to register R0 (step S101). The 
stored value of register R0 is added to the stored value of 
register Rl (step S102). The stored value of register R3 is 
added to the stored value of register R2 (step SI 03). The 
stored value of register R0 is stored at the address in the 



instruction is increased to make the unresolved instruction a 60 memory shown by the stored value of register R4 (step 



long instruction, and the displacement is written in the long 
instruction as a 21 -bit value. Note that there can be cases 
where this extension of an unresolved instruction results in 
the parallel execution code including the unresolved instruc- 
tion violating the restrictions governing the possible com- 
binations of instructions in a parallel execution code, mean- 
ing that simultaneous execution will no longer be possible 



S104). The stored value of register R0 is transferred to 
register R6 (step SI 05). Finally, the stored value of register 
R3 is transferred to register R7 (step S106). 

FIG. 26A shows an example of the executable codes in a 
program that has the present processor execute the process- 
ing shown in FIG. 25, and FIG. 26 B shows an execution 
image. 
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The program is composed of seven instructions. These 
instructions are supplied in the three packets 70-72. The 
processing in each instruction is expressed by the mnemon- 
ics located in each field of the executable codes. As specific 
examples, the mnemonic "mov" represents the transfer of a 
constant or the stored value of a register into a register, the 
mnemonic "add" represents the addition of a constant or the 
stored value of a register to the stored value of a register, and 
the mnemonic "st" represents the transfer of the stored value 
of a register into memory. 

Note that constants are expressed in hexadecimal. Also, 
the expression "Rn (n=0~31)" indicates one of the registers 
in the register file 43. The parallel execution boundary 
information flO and the format information £11 are each 
expressed as "1" or "0". 

The following describes the operation of the processor for 
each execution unit shown in FIG. 26B when processing 
according to the flowchart shown in FIG. 25. 
Execution Unit 1 

Packet 70 is supplied from the memory, and the units in 
packet 70 are transferred to the instruction register 23 in 
order. After this, the instruction issuing control unit 31 refers 
to the parallel execution boundary information flO and 
format information fll of each unit and controls the issuing 
of instructions. In detail, the format information fll of the 
first unit is "1", so that the instruction issuing control unit 31 
links the first unit and second unit and treats them as one 
instruction. The no operation instruction flag of the second 
instruction decoder 34 is set at "1", and the decoding of the 
second unit as an instruction is invalidated. The parallel 
execution boundary information flO of the first unit is "0", 
and the parallel execution boundary information flO of the 
third unit is "1", so that the instruction issuing control unit 
31 issues the first-third units as two instructions. Since all 
of the supplied units are issued, no units are accumulated in 
the instruction buffer 22. 

The executing unit 40 transfers the constant 
"0x87654321" to register Rl and transfers the stored value 
of register R5 to register R0. 
Execution Unit 2 

Packet 71 is supplied from memory, and the units in 
packet 71 are transferred to the instruction register 23 in 
order. The format information fll of all three units is "0", so 
that each unit forms a 21 -bit instruction. The parallel execu- 
tion boundary information fl.0 of the first unit is "0", and the 
parallel execution boundary information flO of the second 
unit is "1", so that the instruction issuing control unit 31 
issues the first and second units as two instructions. The third 
unit is not issued and so is accumulated in the instruction 
buffer 22. 

The executing unit 40 adds the stored value of register R0 
to the stored value of register Rl and stores the result in 
register R0. The executing unit 40 also adds the stored value 
of register R3 to the stored value of register R2 and stores 
the result in register R3. 
Execution Unit 3 

Packet 72 is supplied from memory, and one unit accu- 
mulated in the instruction buffer 22 and the two units in 
packet 72 are transferred to the instruction register 23 in 
order. The format information fll of all three units is "0", so 
that each unit forms a 21 -bit instruction. The parallel execu- 
tion boundary information flO of the first unit and the second 
unit is "0", and the parallel execution boundary information 
flO of the third unit is "1", so that the instruction issuing 
control unit 31 issues all three units as three separate 
instructions. In this case, all of the supplied units are issued 
as instructions. 
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The executing unit 40 transfers the stored value of register 
R0 to the address in the memory shown by the stored value 
of register R4, transfers the stored value of register R0 to 
register R6, and transfers the stored value of register R3 to 

5 register R7. 

As described above, the program that has the present 
processor execute the processing shown in FIG. 25 in three 
execution units. The executable codes are composed of one 
42-bit instruction and 6 21 -bit instructions, so that the total 

to code size is 168 bits. 

Supplementary Explanation for the Instruction Conversion 
Apparatus of the Second Embodiment 
First Specific Example of the Operation of the Instruction 
Conversion Apparatus 

IS The following describes the operation of the characteristic 
components of the present instruction conversion apparatus, 
with reference to specific instructions. 

FIG. 27A shows assembler codes that are generated by the 
assembler code generating unit 111 when source codes are 

20 inputted into the compiler upstream part 110. The instruction 
scheduling unit 112 receives an input of the codes shown in 
FIG. 25. The meaning of each instruction shown in FIG. 27A 
is as follows. 

Instruction 1: the constant 0x1000 ("Ox" showing that the 
25 value is in hexadecimal) is transferred to the register R0. 

Instruction 2: the content of register R0 is stored in the 
memory address indicated by the stack pointer SP. 

Instruction 3: the content of register Rl is transferred to 
register R2. 

30 Instruction 4: the content of register R3 is transferred to 
register R4. 

Instruction 5: the content of register R2 is added to 
register R4. 

The following explains the operation of the instruction 
35 scheduling unit 112 with reference to FIGS. 27B-27E. First, 
the dependency analyzing unit 120 is activated and the 
dependency graph shown in FIG. 27B is generated from the 
codes shown in FIG. 27 A. Next, the instruction rearranging 
unit 121 is activated. When loop2 composed of steps S3-S8 
40 ends, the processing moves to step S9 where the instruction 
rearranging unit 121 determines a group including one or 
more instructions as the arranged nodes. The unit for deter- 
mining such groups is called a "cycle". 
First Cycle 

45 First, the arrangement candidate group is selected (step 
S2). At this point, the nodes with no predecessors are nodes 
1, 3, and 4. Next, the most suitable node is selected (step S4). 
In this example, node 1 is selected. Next, it is judged 
whether node 1 can be arranged (step S5). In this example, 

50 arrangement of node 1 is judged possible (steps Ul, U2), so 
that node 1 is provisionally arranged (step U9). 

At this point, the parallel execution code is as shown on 
the top level of FIG. 27C. Next, the arrangement state is 
judged (step S6). Since the parallel execution code at this 

55 point is as shown on the top level of FIG. 27C, further 
arrangement is judged as being possible. Since no new 
arrangement candidates are generated (step S7), the process- 
ing returns to the start of loop2 (step S8). Since there are still 
nodes remaining in the arrangement candidate group, loop2 

60 is repeated (step S3). Next, the most suitable node is selected 
(step S4). In this example, node 3 is selected. Next, it is 
judged whether node 3 can be arranged (step S5). In this 
example, arrangement of node 3 is judged possible (steps 
Ul, U2), so that node 3 is provisionally arranged (step U9). 

65 At this point, the parallel execution code is as shown on 
the second level of FIG. 27 C. Next, the arrangement state is 
judged (step S6). Since the parallel execution code at this 
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point is as shown oa the second level of FIG. 27C, further Instruction 9: the content of register R3 is transferred to 

arrangement is judged as being possible. Since no new register R4. 

arrangement candidates are generated (step S7), the process- Instruction 10: the content of register R2 is added to 

ing returns to the start of loop2 (step $8). Since there are still register R4. 

nodes remaining in the arrangement candidate group, loop2 5 first, the dependency analyzing unit 120 is activated and 
is repeated (step S3). Next, the most suitable node is selected lhe dependency graph shown in FIG. 28B is generated from 
(step S4). In this example, only node 4 is left, so this is thc oodc ^ QWU in p, a 2 8A. Next, the instruction rearrang- 
selected Next it is judged whether node 4 can be arranged • ^ m and me Ud cxecution boundary appending 
(step S5). In this example, the present paraUel execution ^ u2 afC activated ^ proccssm result for the 
code is as shown on the second leve of FIG. 27C with two { s^uling unit 112 is transferred to the object code 
instructions havmg been provisionally arranged in a long- \ ° " . ~u- l " W J™ 
short pattern. As a result, the processing advances to step US fenerat.ng uo.t 113 and the resu ting code shown in FIG. 
via steps U1-U4. The present processed instruction is a MC 15 85 ^ ect file ' ^ Processing is the 
short instruction, so that the judgement "Yes" is given in step same ■* m lhe firsl embodiment, so only the result is given. 
US and the processing advances to step U6. Nexl » thc lmkm S umt 114 15 activated. The codes shown 
In step U6, dependencies between the provisionally 15 m FIG - include an unresolved address, so that the 
arranged instructions (nodes 1 and 3) and the processed address resolving unit 123 in the linking unit 114 is acti- 
instruction (node 4) are investigated. As can be understood vated. First, in step VI, the address resolving unit 123 
from the dependency graph, no dependency exists between determines the address, so that the address "OxFOOO" is 
these instructions, so that instructions 1, 3, and 4 may be determined as "meml". Since "OxFOOO" is a value that 
executed in any order. As a result, the judgement "Yes" is 20 exceeds 21 bits, the judgement "Yes" is given in both step 
given in step U7, and the instructions in the present parallel V2 and step V4, so that the processing advances to step V7. 
execution code are rearranged into the order 3, 4, 1 in step In step V7, the instruction "Id (meml),R0" is divided in the 
U8. The arranged state is then examined (step S6). At this instructions "raov meml ,R31" and "Id (R31),R0". In this 
point, the parallel execution code is as shown by the third example, register R31 is the register that is reserved for use 
level in FIG. 27C, and since the number of provisionally 25 when the instruction conversion apparatus divides ins true- 
assigned instructions has reached three, the maximum num- lions. Here, the reason the instruction "Id (meml),R0" is 
ber of instructions that can be executed in parallel by the divided is that the only instructions of the processor that can 
processor of the first embodiment, assignment of further handle a 32-bit value are transfer instructions that transfer a 
instructions is judged to be impossible. Accordingly, loop2 value to a register, with there being no load instruction that 
ends and the processing moves to step S9. In step S9, the 30 can directly handle a 32-bit address. Next, in step V8, a 
instructions that have been provisionally arranged are con- parallel execution boundary is inserted between the instruc- 
firmed as being arranged into the present parallel execution tions "mov meml,R31" and "Id (R31),R0". This results in 
code. At this point, the processing of the first cycle is the final executable codes being as shown in FIG. 28D. 
complete. Since unassigned nodes remain, however, loop 1 Comparison with a Conventional Fixed-Length VLTW Pro- 
is repeated (steps S10, SI). 35 cessor 

Second Cycle The following compares, for the processing shown in 

First, the arrangement candidate group is selected (step FIG. 25, the operation of the present processor to the 

S2). At this point, the nodes with no predecessors, nodes 2 operation of a VLIW processor that uses fixed -length 

and 5, are set as the selection candidates. The following instructions as one example of the conventional art. 

processing is the same as in the first cycle and so will not be 40 For a simple VLIW processor that issues a fixed number 

explained. This processing in the second cycle results in of instructions with a fixed instruction length in each cycle, 

these two nodes being arranged as arranged instructions. the setting of instruction length at a suitable value for the 

Next, the instruction rearranging unit 121 inserts a par- transfer of a 32-bit constant to be indicated by one instruc- 

allel execution boundary at the first instruction of each cycle. tion will result in an extremely large increase in overall code 

After these parallel execution boundaries have been 45 size. As a result, instruction length is set at 32 bits, and the 

inserted, the codes are as shown in FIG. 27 D. transfer of a 32-bit constant is performed by dividing it into 

After this, the object code generating unit 113 is activated. two transfer instructions that each transfer 16 bits. 

In the present example, the codes shown in FIG. 27D are FIGS. 29A and 29B show an example of the executable 

outputted as the object file. codes in a program executed by a VLIW processor that 

Finally, the linking unit 114 is activated. Since address 50 executes instructions of a fixed length of 32 bits and an 

resolution is not required for the codes shown in FIG. 27 D, execution image. 

the final executable codes are obtained via the same pro- The program is composed of four packets 73-76. As in 

cessing as a conventional linker. An image of the executable FIG. 26A, the processing content of each field is shown 

codes is shown in FIG, 27E. The actual executable codes are using mnemonics. Here, however, the mnemonic "sethi" 

bit sequences that have been divided into 64-bit units. 55 refers to the storing of a 16-bit constant in the upper 16 bits 

FIG. 28A shows assembler codes that are generated by the of a register and the mnemonic "setlo" refers to the storing 

assembler code generating unit 111 when source codes are of a 16-bit constant in the lower 16 bits of a register. The 

inputted into the compiler upstream part 110. The instruction mnemonic "NOP' refers to an instruction with no operation 

scheduling unit 112 receives an input of the codes shown in content. 

FIG. 28A. The meaning of each instruction shown in FIG. 60 As can be seen from comparing the executable codes in 

28A is as follows. FIG. 29A with the cxecution image in FIG. 29B, all instruc- 

Inst ruction 6: the content of the memory indicated by the tions supplied in one cycle are issued in the same cycle 

label "meml" is loaded into the register R0. under VLIW methods. In other words, three 32-bit instruc- 

Instruction 7: the content of register R0 is stored in the tions are issued in each cycle. When no instructions that can 

memory address indicated by the stack pointer SP. 65 be executed in parallel exist, NOP instructions must be 

Instruction 8: the content of register Rl is transferred to inserted in advance by software. Four NOP instructions are 

register R2. inserted in the present example, making a total of twelve 
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32-bit instructions and a total code size of 384 bits. This is 
much larger than the code size of the code used by the 
processor of the first embodiment. 

Since the transfer of a 32-bit constant into a register is 
divided into two instructions, a new dependency is created, 
so that the number of execution units is increased to four. No 
matter how the instructions are rearranged, this number 
cannot be reduced. As a result, one more execution cycle is 
required than when the same processing is performed by the 
processor of the first embodiment. 

Comparison With a Conventional Processor Where Parallel 
Execution Boundary Information is Present in Fixed-Length 
Instructions 

The following compares, for the processing shown in 
FIG. 25, the operation of the present processor to the 
operation of a processor with fixed-length instructions 15 
including information showing whether there is a parallel 
execution boundary as another example of the conventional 
art. 

This conventional art will be explained with reference to 
a model that executes 32-bit instructions and a model that 20 
executes 40-bit instructions. Like the VLIW method shown 
in FIG. 29, the model that executes 32-bit instructions 
performs the transfer of a 32-bit constant using two instruc- 
tions. However, the model that executes 40-bit instructions 
can perform operations including the transfer of a 32-bit 25 
value into a register using only one instruction. 

FIGS. 30A and 30B show an example of the executable 
codes and an execution image for a program executed by a 
processor that executes instructions which have a fixed 
length of 32 bits and include parallel execution boundary 30 
information. 

The program is composed of eight instructions that are 
supplied as the three packets 77-79. The processing in each 
instruction is shown by the mnemonics that have been 
placed into each field of the executable codes. As in the 35 
VLIW method with 32-bit instructions that was shown in 
FIG. 29, the transfer of a 32-bit constant into a register is 
performed in 16-bit units by two instructions. 

As can be seen from FIGS. 30A and 30B, the transfer of 
a 32-bit constant into a register is performed in 16-bit units 40 
by two instructions, which, as with the VLIW method of 
FIG. 29, generates a new dependency. This means that one 
more execution cycle is required than when the processor of 
the first embodiment is used. 

Since no NOP instructions need to be inserted, the code 45 
size is equal to that of the VLIW method shown in FIG. 29 
minus the code size attributable to the NOP instructions. 
This means that eight 32-bit instructions are used, making 
the total code size 256 bits. However, this is still larger that 
the code size of the code used by the processor of the first 50 
embodiment. 

The following compares the processor of the first embodi- 
ment to a model that uses instructions of a fixed length of 40 
bits. 

FIGS. 31 A and 31 B show an example of the executable 55 
codes and an execution image for a program executed by a 
processor that executes instructions which have a fixed 
length of 40 bits and include parallel execution boundary 
information. 

The program is composed of seven instructions that arc 60 
supplied as the three packets 80-82. The processing in each 
instruction is shown by the mnemonics that have been 
placed into each field of the executable codes. Here, the 
transfer of a 32-bit constant into a register can be performed 
by one instruction. 65 

As can be seen from FIGS. 31 A and 31B, the transfer of 
a 32-bil constant into a register is performed by one instruc- 



tion. This means that a total of three execution cycles are 
required, which is the same as when the processor of the first 
embodiment is used. 
While this conventional art uses the same number of 
5 instructions as the processor of the first embodiment, the 
conventional processor has an instruction length of 40 bits 
which is used for all instructions. The processor of the first 
embodiment has instructions that do not require a large 
number of bits defined as 21-bit instructions. The program 
10 for the conventional processor is composed of seven 40-bit 
instructions, giving a total code size of 280 bits. This is 
larger than the code used by the processor of the first 
embodiment. 

The processor of the present embodiment has been above 
15 by way of embodiments, although the processor should not 
be construed as being limited to these embodiments. Several 
example modifications are given below. 

(1) The above embodiments use a premise that scheduling 
is performed statically, although this is not a limitation 

2 0 for the present invention. In other words, the present 
invention can also be adopted by a processor that 
dynamically schedules instructions, such as a supersca- 
lar processor. When doing so, parallel execution bound- 
ary information is not provided in the instructions, and 
the decoder is provided with a parallel execution inves- 
tigating apparatus for dynamically investigating 
whether instructions can be executed in parallel. The 
control in the above embodiments that was performed 
by the instruction issuing control unit referring to the 
parallel execution boundary information can be per- 
formed by referring to the output of the parallel execu- 
tion investigating apparatus. Such a construction 
reduces the amount of hardware used by a processor 
executing variable length instructions, thereby main- 
taining the effect of the present invention. 

(2) The above embodiments describe the case where a 
maximum of three instructions are executed 
simultaneously, although the present invention is not 
limited to this number. As one example, a construction 
where two instructions are simultaneously issued may 
be used. When doing so, suitable changes only need to 
be made to the construction of the decoding unit and 
periphery of the instruction register, and to the calcu- 
lators in the executing unit. 

(3) As can be seen from the instruction formats given in 
FIGS. 6A-6F, the above embodiments handle instruc- 
tions that are composed of one or two units. However, 
this is not a restriction for the present invention, so that 
instruction formats where three or more units are linked 
to form one instruction may also be defined. As one 
example, when instructions are composed of up to four 
instruction units, two bits can be used as the format 
information of each instruction. 

(4) As can be seen from the instruction formats given in 
FIGS. 6A-6F, the above embodiments handle instruc- 
tions that are composed of one or two units. However, 
instructions composed of a single unit do not need to be 
used. As an alternative example, one instruction may be 
composed of two or three units. In such case, only the 
wiring between the instruction register, the instruction 
decoder, and the constant operand needs to be changed. 

(5) As can be seen from the instruction formats given in 
FIGS. 6A-6F, the instructions described in the above 
embodiments include information showing whether 
there is a parallel execution boundary. This information 
may not be provided, however. In such case, instruc- 



